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CONSISTENT NONPARAMETRIC ESTIMATION 
FOR HEAVY-TAILED SPARSE GRAPHS 


CHRISTIAN BORGS, JENNIFER T. CHAYES, HENRY COHN, AND SHIRSHENDU GANGULY 

Abstract. We study graphons as a non-parametric generalization of stochastic block models, 
and show how to obtain compactly represented estimators for sparse networks in this framework. 
Our algorithms and analysis go beyond previous work in several ways. First, we relax the usual 
boundedness assumption for the generating graphon and instead treat arbitrary integrable graphons, 
so that we can handle networks with long tails in their degree distributions. Second, again motivated 
by real-world applications, we relax the usual assumption that the graphon is defined on the unit 
interval, to allow latent position graphs where the latent positions live in a more general space, and 
we characterize identifiability for these graphons and their underlying position spaces. 

We analyze three algorithms. The first is a least squares algorithm, which gives an approximation 
we prove to be consistent for all square-integrable graphons, with errors expressed in terms of the 
best possible stochastic block model approximation to the generating graphon. Next, we analyze 
a generalization based on the cut norm, which works for any integrable graphon (not necessarily 
square-integrable). Finally, we show that clustering based on degrees works whenever the underlying 
degree distribution is atomless. Unlike the previous two algorithms, this third one runs in polynomial 
time. 
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1. Introduction 

Motivated by numerous real-world technological, social, and biological networks, the study of 
large networks has become increasingly important. Much work in the statistics and machine learning 
communities has focused on the questions of modeling and estimation for these networks. 

1.1. Stochastic block models and VP-random graphs. Many previous papers have described 
these networks in terms of parametric models, one of the most popular being the stochastic block 
model, introduced in [42]. These models can be characterized by a vector of probabilities p = {pi) 
on a finite set of communities and a matrix B = (/3ij) of “affinities.” Given these parameters, one 
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then generates a graph on labeled n nodes by assigning a community to each vertex, independently 
at random according to the probability distribution p, and then connecting vertices belonging 
to communities i and j with probability I3ij. Hence the stochastic block model for k groups is 
determined by {k — 1) + k{k + l)/2 parameters. Such a model is often considered a reasonable 
approximation of a small social network characterized by a limited number of communities. 

More recently, motivated by extremely large networks, researchers have begun to consider non- 
parametric stochastic block models, for which there is a continuous family of communities, i.e., 
for which the k x k matrix of edge probabilities is replaced by a two-dimensional function. The 
non-parametric models we study in this paper are usually referred to as IT-random graphs or 
latent position graphs. In the most general setup, such a model is defined in terms of a probability 
space^ (the space of latent positions or features) and a graphon W over (H,7r), defined as an 

integrable, non-negative function on H x H that is symmetric in the sense that W{x,y) = W{y,x) 
for all X, y G H. To generate a graph on n nodes, one then chooses n “positions” xi, . .., Xn i.i.d. at 
random from (H, vr) and, conditioned on these, chooses edges independently, with the probability of 
an edge between vertices i and j given by W{xi,Xj). The resulting graph is called a W -random 
graph. 

As originally proposed in [41], the space of latent positions H comes equipped with a metric 
and the probability of connection is a function of distance, but the more general setting we have 
described is commonly studied. Note that in the dense setting, this model is quite natural, since it 
can be shown [6,32,43] that if a random graph G is the restriction of (an ergodic component of) an 
infinite, exchangeable random graph, then G must be an instance of a Vk-random graph for some 
function W with values in [0,1]. Due to this connection, IT-random graph models are often called 
exchangeable graph models. 

1.2. Dense and sparse graphs. To model sparse graphs in this non-parametric setup, one uses 
connection probabilities which are given by symmetric function IT times a target density p, leading 
to the model of “inhomogeneous random graphs” defined in [11], with nodes i and j now being 
connected with probability min{l, pIT(xj, x^)}. For both dense and sparse graphs, this kind of 
model is related to the theory of convergent graph sequences [12,15-19]. In the setting of dense 
graph limits, IT-random graphs were first explicitly proposed in [53], although they can be implicitly 
traced back to the much earlier work of [43] and [6] mentioned above. The term graphon originated 
in [18]. 

While for dense graphs one only needs to consider bounded graphons (indeed, the results of [6,43] 
imply that it is enough to consider graphons that take values in [0,1]), this boundedness assumption 
is not very natural for sparse graphs. Indeed, it is not hard to see that for bounded graphons IT, 
all degrees in a IT-random graph are of the same order (except in very sparse settings, where the 
maximum degree might differ from the average degree by a logarithmic factor). While this is no 
problem for dense graphs, since here the average degree is of the same order as the number of 
vertices, and hence automatically of the same order as the maximal degree, it is a serious restriction 
for sparse graphs. Indeed, many real-world networks have long-tailed degree distributions. For 
applications, one would therefore want to consider unbounded graphons IT. 

1.3. Estimation and previous literature. How can we estimate a graphon IT given a sample 
G of a IT-random graph? This problem encapsulates the idea of inferring the underlying structure 
in a random network. 

For the special case where IT is a stochastic block model, the estimation problem is closely related 
to the problem of graph partitioning and has been intensely studied in the literature [34,42,65], 
using methods that range from maximum likelihood estimates [64] and Gibbs sampling [60] or 

^As usual, a full specification of the probability space (Sl,7r) requires the specification of a cr-algebra T in addition 
to the underlying space £7 and measure tt. We will discuss measure-theoretic technicalities only when they seem 
important or could potentially cause confusion. 
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simulated annealing [45] to spectral clustering [13,25,30,31,55] and tensor algebra [8]. Proving 
consistency of these methods is often not hard in the dense regime, but it becomes more difficult for 
sparse graphs. See, for example, [50,51] for a proof of consistency for spectral clustering when the 
average degree is as small as logn, and [2,3] for an effective algorithm that is provably consistent as 
long as the average degree diverges. 

Estimating graphons that are not block models is more challenging. This problem is implicit 
in [46], but the first explicit discussion of the non-parametric problem we are aware of was given 
in [9], even though the actual consistency proof there is still limited to stochastic block models 
with a fixed number of blocks. The restriction to a fixed number of blocks was relaxed in [58] and 
[29]. The full non-parametric model was studied in [10], under the assumption that none of the 
eigenfunctions of the operator associated with the kernel W is orthogonal to the constant function 1 
and the eigenvalues are distinct. 

Many further papers have been written on graphon estimation, including [1,5,7,20-23,26-28, 
35,37-39,47,48,52,54,56-58,61,63,66-69]. Each paper makes different assumptions about the 
density and the underlying graphon. Strong results are known for dense graphs: [23] shows how to 
approximate arbitrary measurable graphons W with values in [0,1] given a dense IT-random graph, 
and [37] attains an optimal rate for least squares estimators of both stochastic block models and 
Holder-continuous graphons from a dense graph. For sparse graphs, [66] proves convergence of a 
maximum likelihood estimator under the assumption that W is bounded, bounded away from 0, and 
Holder-continuous. Most recently, [20] introduces a modified version of the least squares algorithm 
that optimizes over block models with bounded L°° norm; this algorithm achieves consistency for 
arbitrary bounded graphons and arbitrary densities, as long as the average degree diverges with the 
number of vertices. The same paper also gives a differentially private version of the least squares 
algorithm which works again for arbitrary bounded graphons, now requiring that the average degree 
grows at least like a logarithm of the number of vertices. Independently, [47] proposes and analyzes 
the modified (non-private) algorithm and proves matching upper and lower bounds for the rates 
achieved by this algorithm. 

But more important than some of the technical assumptions used by previous authors is the 
fact that all the previous results we are aware of require W to be bounded. As pointed out before, 
this assumption, while natural for dense graphs, rules out most degree distributions observed in 
real-world networks. Our goal here is to remove this assumption. 

1.4. Identifiability. Before summarizing our contributions, we need to discuss the fact that in 
general, W cannot be uniquely determined from the observation of even the full sequence (G'„)„>i, 
a problem called the identifiability issue in the literature; see, for example, [9,22]. To discuss 
this, consider two graphons W and W' over two probability spaces (H,7r) and (H',7r'), as well as a 
measure-preserving map (j): (H, tt) —)■ (H',7r'). Define the pullback of W' to H as the graphon 
defined by (IT')'^(x, y) = W'{(j){x), (t){y)). It is not hard to see that then the sequences of random 
graphs generated from W and W' have the same distribution if IE = While it was stated in 

some of the early literature on graphon estimation that the converse is true as well, that turns out 
to be false; see, for example. Example 2.7 below for a counterexample. To formulate the correct 
statement, we define W and W' to be equivalent if there exists a third graphon U over a probability 
space such that W = and W' = for two measure-preserving maps (/>, i/: from Q and 

0,' to 0,"] see Section 2.4 for more details. 

With this definition, we are now ready to characterize the full extent to which W is not identifiable: 
The sequences generated from two graphons W and W' are identically distributed if and only if W 
and W' are equivalent. In the dense case, this was proved in [32] for the case where W and W' are 
defined over [0,1] equipped with the uniform distribution, and for the case of general probability 
spaces it follows from the results of [14] by a simple argument involving subgraph counts. But for 
the sparse case, and general integrable (rather than bounded) graphons, this is a new result, proved 
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in this paper (Theorem 2.6 in Section 2.4). Thus, both the feature space (J4,7r) and the graphon W 
are unobservable in general, and even if we fix the feature space there is no “canonical graphon” an 
estimation procedure can output. The best we can hope for is a representative from an equivalence 
class of graphons. 

In light of these facts, the natural way of dealing with the identification problem is to admit 
that there is nothing canonical about any particular representative W, and to define consistency as 
consistency with respect to a metric between equivalence classes, rather than between graphons 
themselves. The papers [66] and [47] follow this strategy, by using a variant of the metric which 
is a metric over equivalence classes. Most other papers either avoid the identifiability problem 
altogether, by redefining the problem as the problem of finding an approximation H to the matrix 
Hn{W) = (W{xi, Xj))i<ij<n (see, for example, [23] or [37]), or by making additional assumptions 
which guarantee the existence of a canonical representative, e.g., by postulating that W is defined 
over the interval [0,1] and assuming that after a measure-preserving transformation, the “degrees” 
Wx = /q W{x,y) dy are strictly monotone in x, in which case there is a canonical representative of 
the graphon W. 


1.5. Goals. In this paper, we follow the spirit of [66] and define consistency with respect to a metric 
on equivalence classes of graphons, but in contrast to [66], we allow for more general spaces than 
just the uniform distribution over the unit interval.^ To define our notion of distance, we recall that 
a coupling between two probability measures tt, vr' is a measure v on the product space such that 
the projections of u to the two coordinates are equal to tt and tt', respectively. Given p> 1 and 
two graphons W over (14, tt) and W' over (14', tt') (i.e., graphons such that jlTj^dvr < oo and 
Jq, Jir'l^dvr' < oo), we then define the distance Sp(W, W') by 


( 1 . 1 ) 


6p{W,W') = inf W{x,y) — W'{x',y) dv{x,x') dv{y, 


i/p 


where the infimum is over all couplings u of tt and tt'. 

Having defined a metric on equivalence classes of graphons, we can now formulate the estimation 
problem considered in this paper: Given a single instance of a IT-random graph defined on an 
unobserved probability space (14, tt), find an algorithm that (a) outputs an estimator W such that 
W has a eoncise representation whose size grows only slowly with n; (b) estimates W consistently 
assuming just integrability conditions] (c) works for arbitrary target densities, as long as the graph 
is not too sparse (say has divergent average degree); and (d) runs in polynomial time. 

While efficiency (property (d)) is clearly important for practical applications, our main focus in 
this paper will be the fundamental problem of consistent estimation under as few restrictions on W 
as possible, i.e., algorithms achieving properties (a)-(c). Indeed, none of the three algorithms we 
study in this paper achieves all four properties. Two of them achieve (a)-(c), and hence solve the 
desired problem of consistent estimation, but do not run in polynomial time. The third achieves (a), 
(c), and (d), and hence is efficient, but requires an additional condition to ensure consistency. 


1.6. Summary of results. In this paper, our estimator W will be given in terms of a block model, 
with a number of blocks that grows slowly with the number of vertices of the input graph. Given 
this framework, it is natural to compare the performance of our algorithm to the best possible block 
model in a suitable class of block models. Here we consider the class B>k = {(p, B): minipi > a} 


^Note that from a purely measure theoretic approach to lU-random graphs, one can restrict oneself to graphons 
over the unit interval without any loss of generality, since every integrable graphon W is equivalent to a graphon W' 
defined over [0,1] equipped with the uniform distribution; see Theorem 2.9 in Section 2.4. However, when W is given 
in an application, it is often a continuous function over a higher dimensional space, and while W' leads to the same 
distribution of lU-random graphs, the transformation from W to W' ruins continuity, which is often needed to prove 
good approximation bounds. For applications, the general setup is therefore more natural. 
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of all block models with minimal block size at least k. For an approximation outputting a block 
model in 6 >k, the best error we could achieve is 

(1.2) e>l{W)= inf Sp{W,W'). 

We often refer to this benchmark as an oracle error, since it is the best an oracle with access to 
the unknown W could do. Our goal is to prove oracle inequalities that bound the estimation error 
in terms of the oracle error, as well as a few additional terms that account for variance and the 
visibility of heavy tails at finite scale. 

When establishing the estimation error for W, we usually first prove a bound on the estimation 
error for the intermediate matrix Qn = (min{l,/9kF(xi, which will be expressed in 

terms of an oracle error for plus a concentration error stemming from the fact that, even after 
conditioning on Qn, the observed graph Gn is random; see Theorem 3.2 and Theorem 4.2 below. 
In a second step, we then prove consistency for the original estimation error, given bounds that 
estimate the difference between W and W. Note that part of the literature stops at the first step, 
effectively avoiding the identihability issue discussed above. 

In this paper, we consider three algorithms for producing a block model approximation to W 
from a single instance of a VF-random graph G: two inefficient ones and one whose running time is 
polynomial in n. 

(1) The well-known least squares algorithm, which has been analyzed under various additional 
assumptions on IT, until recently [20] not even covering arbitrary bounded graphons. Here 
we will prove consistency of this algorithm in the metric <52 for arbitrary graphons. 

(2) A least cut norm algorithm, which we prove to be consistent under the cut norm for arbitrary 

graphons. 

(3) A degree sorting algorithm, which we show is consistent whenever the degree distribution of 

IT is atomless. (Graphons with this property are equivalent to graphons over [0,1] such that 
lk(r = /o dy is strictly monotone in x.) This algorithm runs in polynomial time. 

To state our results, we need a few definitions. As usual, [n] denotes the set {1,... ,n}. Given 
an n X n matrix A, we use ||A||p to denote its norm, defined by ||A||p = ^ j Given a 

graph G on [n], we use A{G) to denote the adjacency matrix of G, and p{G) = ||A(G)||i to denote 
its density. We identify partitions of [n] into k classes (some of which can be empty) with maps 
tt; [n] —)■ [/c], where Vi = Vi{7r) = 7r“^({f}) is the class of the partition. Given such a map and a 
k X k matrix B, we will use for the nx n matrix with entries Finally, for an n x n 

matrix A, we use A-,^ to denote the matrix where for each {x,y) G V) x V), the matrix element Axy 
is replaced by the average over Vi x Vj, and A/tt to denote the k x k matrix of block averages 

{AI'K)ij = I I I I ^ Auv, 

dehned to be 0 if either Vi or Vj is empty; note that the two are related by Aj^ = (A/vr)^. 

Throughout this paper, we will assume that the graph is sparse (in the sense that p —)■ 0), but 
that it has divergent average degree (i.e., we assume that np —)■ oo). Under these assumptions we 
will prove the following results. 

Least squares algorithm. Given an input graph G on n vertices and a parameter n G (0,1] such 
that Kn > 1, let 

(1.3) (tt, H) G argmin ||A(G) - H’"|| 2 , 

7r,S 

where the optimization is over all A: x A: matrices B and all partitions tt: [n] —)■ [A:] such that all 
non-empty classes of tt have size at least [auJ , with k chosen such that it can accommodate all such 
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partitions, say k = • Setting pi to be the relative size of the partition class of vr, i.e., 

Pi = -\Vi{n)\, 
n 

the least squares algorithm then outputs the block model W = (p,-B). Note that the above 
minimization problem is slightly helped by the fact that we minimize the norm. For a given tt, 
the minimizer B can therefore be obtained by averaging A{G) over the classes of tt, showing that B 
is of the form A{G)/7:. Nevertheless the algorithm is inefficient, since we still need to minimize over 
partitions vr: [n] —)• [fc]. 

Our main result concerning this algorithm is that if G is a IF-random graph at target density p 
and W ^ , then the algorithm is consistent in the sense that 


<52 




^ 0 


with probability one as n —)■ oo, as long as k —)■ 0 and n ^ log(l/K) = o{np). If instead of almost 
sure convergence we content ourselves with convergence in probability, then for k G (n“^, 1] and 
i+iog(i/«^) _ have the oracle inequality 


<521 -W,W] =Op| 


,( 2 ) 

'>K 




when ||IF||i = 1, where tailp (IF) is a term which measures the difference between IF and 
k min{l,/9lF} in the norm; see Theorem 3.1 for the details. 

The four error terms above arise for different reasons: first, when estimating the L?' distance 
between the matrix of probabilities Qn and the estimator IF, one encounters an oracle error for 
Qn and a concentration error, the latter being the second term in the above bound. Second, one 
encounters an additional error when bounding the oracle error for Qn in terms of the oracle error 
for IF. Since Qn is random, this involves another concentration error, which is the third term in 
the bound above. Finally, we need to estimate the <52 distance between IF and ^Qni which involves 
both bounding the distance between IF and ^ min{l,pIF}, and the distance between min{l,/3lF} 
and Qn- It turns out that the latter error can be absorbed in the other terms present above, while 
the former leads to the term tail), aIF). 

Note that the term the oracle inequality is larger than the next term when 

p < 1/logn. We have included both terms to handle the case in which p is large enough that the 
latter term dominates, but N should be viewed as the primary term. 

Y Ki p7~i 

For general graphons, our results do not give explicit error bounds, since all we know is that 

/o\ 

e>yiF) and tail), aIF) go to 0 as k —)■ 0 and p —)■ 0. But in many applications, one has additional 
information on the generating graphon, for example that it is actually a stochastic block model with 
a fixed number of classes, in which case both e^^IF) and tail^yiF) become identically zero once k 
and p are small enough, leaving us only with the explicit terms in the above bound. 

Another class of examples is a-Holder-continuous graphons over equipped with a probability 
measure that decays fast enough to make the function \x\^ integrable. This class encompasses 
many models of latent position spaces used in practice. When IF is a-Holder-continuous and \x\^ is 
integrable with a G (0,1] and /3 > 2a, we prove that e>^(IF) = 0{k°‘') and tailp^^(IF) = 0{p^') for 
some a',/3' > 0, with a' = a/d and = oo in the simple case of the uniform distribution over a 
box of the form [—R,R]'^. See Propositions 6.1 and 6.3 in Section 6 below. 

This scaling behavior for the oracle error and tail bounds is typical. We have stated the oracle 
inequality in full generality, but when the graphon is sufficiently well behaved to estimate the 
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oracle error and tail bounds, one can balance the error terms and derive the scaling rate for k that 
optimizes these bounds. For example, suppose the error bound is 




K^pn 


nn 


Choosing n proportional to y optimizes this bound (assuming np —)■ oo as n —)• oo) and 

yields an error bound of 




/ log(pn) \ 

\ pn J 


which becomes Op ^ ^ 4a+2ti^ a-Holder-continuous graphon over [—R, R]'^ 

equipped with the uniform distribution. 

Least cut norm algorithm. To give an explicit description of the least cut norm algorithm, we 
need the notion of the cut norm, first introduced in [36]. For an n x n matrix A, it is defined as 


(1.4) 


1 

□ = max 
5,TC[n] 


(hi)eSxT 


One way to define the least cut norm algorithm would be to output a block model defined in terms 
of the minimizer of ||T(G) — But since we now need to minimize the cut norm rather than 

an norm, this would involve yet another optimization problem to find the best matrix B for 
each distribution tt. To circumvent this issue, we always obtain B by averaging. In other words, we 
calculate 


(1.5) TT e argmin ||T(G) - (yl(G))vr||n, 

TT 

where the argmin is again over partitions vr; [n] —)• [k] such that every non-empty partition class has 
size at least [kuJ . The least cut norm algorithm then outputs the block average corresponding to tt; 
i.e., it outputs the block model W = (p, B) where pi is again the relative size of the partition 
class of TT and B = A{G)/n. 

We will show that the least cut norm algorithm is consistent in the cut metric (5n on graphons, 
defined similar to 5p, except that now we use the cut norm instead of the norm || • jjp; see (2.3) 
below for the precise definition. More precisely, we will show that a.s., the error in the Jn distance 
goes to zero for a IF-random graph G if a —t- 0 in such a way that = o(^^^). In addition 
to consistency, we will again show a quantitative bound, this time stating that for an arbitrary 
normalized graphon W and k G Ij; 

<5o((rr, ,^/) = O, ; 

see Theorem 4.1 in Section 4. The four error terms have the same explanation as the error terms 
for the least squares algorithm: the oracle error for W, a concentration error appearing when 
estimating the cut norm error with respect to Qn, a concentration error stemming from the random 
nature of the oracle error for Qn, and a tail bound stemming from the fact that for unbounded 
graphons, the matrix Qn generating Gn involves a truncation of the entries which are larger than 
1. For Holder-continuous graphons over we can again give explicit error bounds of the form 
e>^(lF) = 0 (k“') and tailp^^(lF) = 0{p^'); see Propositions 6.1 and 6.3 in Section 6. 

Degree sorting algorithm. The last algorithm we consider in this paper is the degree sorting 
algorithm, which proceeds as follows. Given a degree G on n vertices with vertex degrees di,..., dn, 
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we sort the vertices by choosing a permutation a of [n] such that 

dcr(l) ^ ^(7(n) ^ ^ ^cr(n)' 

To separate the sorted vertices into k classes of nearly equal size, we choose integers 0 = no < ni < 

• ■ • < Uk = n such that 

in 

n* - — <1, 

and we define vr: [n] —)■ [/c] by 7r(j) = i if Ui-i < a{j) < ni. Thus, vr groups the vertices into k 
classes, sorted by degree. The output of the algorithm is the block model W = {p, B) with pi = 1/k 
and B = A{G)/tt. In other words, we simply cluster vertices with similar degrees and then average 
over these clusters. 

This algorithm has the advantage of being very efficient, but it has no hope of working unless the 
degrees suffice to distinguish between the vertices. More precisely, we need the limiting distribution 
of normalized degrees to be atomless (i.e., there should not exist a nonzero fraction of the vertices 
with nearly the same degree). If G is a IT-random graph, then we can express the limiting degree 
distribution as n —)■ oo in terms of IT. We do so in Section 2.6. If the degree distribution of IT is 
atomless, then the degree sorting algorithm is consistent in the sense that 5i{p~^W, W) —)■ 0 almost 
surely, provided that the number k of classes tends to infinity in such a way that log A: = o{npn) 
and k = o[n^/p^y See Theorem 5.1 for a precise statement. 

Graphs with power-law degree distribution. As an example of random graphs which require 
unbounded graphons, we consider two simple models for graphs with power-law degree distributions. 
Both are generated by graphons over [0,1], with the first one given by W{x,y) = g{y)), 

where g{x) = (1 — a)(l — x)~°‘ for some a G (0, 1), and the second one given by W{x,y) = g{x)g{y). 
Both can be seen to have a degree distribution with density function /(A) = i.e., a 

power-law degree distribution with exponent 1 . Both graphons are in as long as 1 < p < -. 

It turns out that the first graphon can be expressed as an equivalent Holder-continuous graphon 
over equipped with a heavy-tailed distribution, while this is not possible for the second; see 
Section 7 for details. But both fit into our general theory, implying consistency for all three 
algorithms without any additional work, and both allow for explicit bounds similar to the ones 
obtained for Holder-continuous graphons, even though only one of them can actually be expressed 
as a Holder-continuous graphon. See Lemma 7.1 for the precise estimates. 

1.7. Comparison with related results. As discussed above, our primary contribution in this 
paper is to analyze the case of unbounded graphons, thus removing the restriction to networks in 
which all the degrees are of the same order. We also formulate our results over general probability 
spaces, which increases their applicability. (One can always pass to an equivalent graphon over [0,1], 
but standardizing the underlying space prevents taking advantage of any smoothness or regularity 
the graphon possesses, because these properties are not invariant under equivalence.) 

Least squares estimation is of course not a novel idea. Motivated by results of Choi and Wolfe 
[28] on estimating block models, Wolfe and Olhede proved consistency of least squares estimation 
for bounded graphons given sparse graphs (in an updated version of [66] that has not yet, as 
of this writing, been circulated publicly), under the additional hypotheses of Holder continuity 
and being bounded away from zero. Borgs, Chayes, and Smith [20] and Klopp, Tsybakov, and 
Verzelen [47] proved consistency for bounded graphons, again given sparse graphs, with no additional 
assumptions, but they did not handle the unbounded case. Our paper thus completes the analysis 
of this important algorithm, by extending it to the full range of graphons that describe sparse 
networks. 

Bounded graphons are automatically square-integrable, but that is not necessarily true for 
unbounded graphons. Least squares estimation is an appropriate technique only for graphons, 
and we propose least cut norm estimation as a substitute that is applicable to arbitrary graphons. 
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Exact optimization is asymptotically inefficient for both the least squares and the least cut norm 
algorithms. Thus, our consistency results should be viewed not as a proposal that exact optimization 
should be carried out in practice for large networks, but rather as a benchmark for approximate or 
heuristic optimization. 

Degree sorting has the advantage of efficiency, although it works only for graphons whose degrees 
are sufficiently well distributed. The idea of clustering vertices according to degree has a long history 
(see, for example, [33]), as well as connections with the theory of random graphs with a given degree 
sequence [24,49]. Degree sorting has recently been analyzed as a graphon estimation algorithm by 
Chan and Airoldi [22]. They showed that their sorting and smoothing algorithm is consistent for 
dense graphs under a two-sided Lipschitz conditions on the degrees of the underlying graphon. Our 
analysis accommodates sparse graphs and even unbounded graphons, while avoiding these Lipschitz 
conditions. 

1.8. Organization. The rest of the paper is organized as follows. Section 2 covers preliminary 
definitions and results, including equivalence and identifiability, random graph convergence, and 
degree distributions. Our three main algorithms are analyzed in Sections 3, 4, and 5. Section 6 
examines how our bounds behave given a greater degree of regularity than we assume elsewhere in 
the paper (namely, Hblder continuity). Finally, Section 7 analyzes two examples of graphons that 
yield power-law degree distributions. 


2. Preliminaries 

2.1. Notation. As usual, we use [n] to denote the set {1,. .., n}. The density of an n x n matrix 
H is defined as p{H) = ^ Yli j the density p{G) of a graph G is defined as the density of 

its adjacency matrix.^ We use A to denote the standard Lebesgue measure on [0,1] (or, when we 
do not expect this to create confusion, for the Lebesgue measure on [0,1]^). We use to denote 
the simplex of probability measures on [k], i.e., = {p = (pi) G : YliPi — !}• The notation 

Op means big-0 in probability: if X and Y are random variables, then X = Op(Y) means for each 
e > 0, there exists an M such that |A| < M|y| with probability at least 1 — e. 

Finally, we use the abbreviation a.s. for “almost surely” or “almost sure” and i.i.d. for “independent 
and identically distributed.” 

We will also consider general probability spaces (D, tt), where 7^ is a u-algebra on D and vr is a 
probability measure on D with respect to X. As usual, a map (j): (fl,X,7r) —)■ (D',7^',7r') is called 
measure preserving if for all F' G 7^', G F and 'n{4)~^{F')) = tt'{F'). We call such a map 

an isomorphism if it is a bijection and its inverse is measure preserving as well, and an isomorphism 
modulo 0 if, after removing sets of measure zero from D and 0,', it becomes an isomorphism between 
the resulting probability spaces. 

In addition to the distance 5p, we also consider the (in general larger) distance Sp{A,B) between 
two n X n matrices A, B, defined as 

(2.1) 5p{A,B)=mm\\A^ -B\\p, 

(7 

where the minimum is over all bijections a\ [n] —?• [u], the matrix A^ is defined by (A°')ij = 

and the norm of an n x n matrix A is defined by ||A||p = je[n] Note that by definition, 

6p{A,B) is a distance invariant under relabeling; i.e., it is a distance on equivalence classes of n x n 
matrices with respect to relabeling of the “vertices” in [n]. We will need a similar version of the cut 
distance ||A — i?||n. It is defined as 

(2.2) SniA,B)=mm\\A^ -B\\n, 

<J 

^Note that the density of a simple graph is often defined as the number of non-zero entries in A{G) divided by ( 2 ); 
this definition is related to ours by a multiplicative factor which becomes irrelevant as n —>■ 00. 
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where the minimum is again over all bijections a: [n] —?• [n] and || • ||n is defined in (1.4). 

Note also that the norm is related to a scalar product (•, •) via \\A \\2 = {A, A), with the scalar 
product between two n x n matrices A, B defined as 

{A,B) = ^ ^ AijBij. 
i,j&[n] 

2.2. Graphons and the cut metric. Given a probability space a measurable function 

Vb : n X n —7- M is called symmetric if W(x, y) = W{y, x) for all x,y G Q. We call such a function a 
graphon if it takes non-negative values and ||Vk||i < oo, where as usual, the norm of a function 
/: X Q —^ M is defined by ||/||p = J^^^\f{x,y)\P d7r{x) d7r{y). We call W an U* graphon if 

||W||p < oo, and we say that W is normalized if ||W||i = 1. 

We will refer to W as a graphon over (H, tt), or often just as a graphon over H when the 
cj-algebra J- and the probability measure tt are clear from the context. For example, when we say 
that IF is a graphon over [0,1], we mean that IF is a graphon over [0,1] equipped with the Borel 
cj-algebra and the uniform measure, unless stated otherwise. Note that graphs are special cases of 
graphons: given a graph with vertex set V and adjacency matrix A, we view it as a graphon on V 
by equipping V with the uniform distribution and choosing W{u,v) to be A^v 

In addition to the norm of a graphon IF, we will also use the cut norm ||lF||n, defined as 

||lF||n = sup / W{x,y)dTT{x)dTT{y) , 
s,Tcn JsxT 

where the supremum is over measurable subsets of H (i.e., elements of B). The corresponding metric 
is defined for a pair of graphons IF and IF' on two probability spaces (H, tt) and {Q!,tt') by 

(2.3) (5n(lF, IF') = inf sup [ (w{x,y)-W'{x',y')) diz{x,x')dn{y,y') , 

^ s,rcoxQ' JsxT^ ' 

where the infimum is over couplings v of the two measures tt and vr' and the supremum is over 
measurable subsets of 11 x 11'. Because graphs are special cases of graphons, this in particular defines 
a distance between a graph and an arbitrary graphon. 

Remark 2.1. We will often consider graphons over [0,1] (with the Borel n-algebra unless otherwise 
specified). For such graphons, both the cut distance dn and the distance Sp can be defined in a 
simpler way. Specifically, 

(2.4) F(lF,lF') = inf||lF'*’-lF'|L and (5n(lF, IF') = inf ||1F'^ - lF'||n, 

<i> <i> 

where the infima over are over isomorphisms from [0,1] to itself. In fact, this simpler definition is 
equivalent to the definitions (1.1) and (2.3) for many spaces used in practice, as long as they are 
atomless; see Lemma A.l in Appendix A for the precise setting. Lemma A.l also shows that for 
many spaces of interest, in particular both the unit interval [0,1] with the uniform distribution and 
any finite probability space, the infima in the expressions (1.1) and (2.3) are actually minima. 

2.3. IF-random graphs. Given a normalized graphon IF and a target density p, we define two 
random graphs QnipW) and GnipW) on [n] as follows. First, we choose i.i.d. elements xi ,... ,Xn 
from the probability space (n,J^, tt); these elements will index the vertices of the graphs. Let 
Qn = Qn{pW) be the n x n matrix whose ij entry is equal to min{l, pW{xi,Xj)} if i / j and 0 if 
i = j. We view Qn as a weighted graph on n vertices, and we define a corresponding unweighted 
graph Gn by including the edge between vertices i and j with probability {Qn)ij (independently for 
each i and j). We call Qn a weighted W-random graph at target density p, and Gn a W-random 
graph at target density p. 

In addition to the graph Gn{pW) and the weighted graph Qn{pW), we will sometimes also 
consider the weighted graph Hn{W), defined as weighted graph with entries {Hn{W))ij = W{xi,Xj) 
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for i / j, and {Hn{W))ii = 0; in contrast to the definitions of Gn{pW) and Qn{pW), which we will 
only use for graphons, the latter notation will be used even if W takes values in M, rather than in 
[0,oo). 

Since Qm and Hn are trivial for n = 1, we will always assume that n >2 without explicitly 
stating this. 

Remark 2.2. The expected densities of the graphs Qn and Gn are || min{l,/ j1T}||i, which is p when 
pW is bounded above by 1 and W is normalized, and which is (1 + o(\))p if p = pn —^ 0 as n —)• oo. 
That is why we call p the target density for Qn and Gn- 

Note that many models of random graphs can be written as IT-random graphs. 

Example 2.3 (Stochastic block model on k blocks). Let Q = [k], and let the probability distribution 
TT on n be given by a vector p = (pi,... ,pk) £ A^. Setting W{i,j) = /3ij for some symmetric 
matrix B = {fiij) of non-negative numbers then describes the standard stochastic block model.^ 
with parameters (p, i?) We denote the set of all block models on k blocks by Bk and use B to denote 
the union B = IJfe>i ^k- For k G (0,1/2], we use B>n to denote all block models (p, B) such that 
Pi > K for all i. 

Alternatively, we can use the uniform distribution over the interval [0,1] as our probability space. 
Then we define W by first partitioning [0,1] into k adjacent intervals of lengths pi, - - - jPk, and then 
setting W equal to Pij on R x Ij. Note that the random graphs generated by W and W are equal 
in distribution. We denote the graphon W by W[p,il], or by W[i?] if all the probabilities pi are 
equal. (We will also sometimes abuse notation by identifying it with W, when this does not seem 
likely to cause confusion.) 

Example 2.4 (Mixed membership stochastic block model). To express the mixed membership 
block model of [4] as a W-random graph, we define 12 to be the k dimensional simplex and 
equip it with a Dirichlet distribution with some parameters a = (ai,..., aQ- In other words, the 
probability density at {pi, ■ ■ ■ ,Pk) is proportional to 
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Given a symmetric matrix {/3ij) of non-negative numbers, we then define 

IF(p,p') = Y^PijPip'y 

As in the stochastic block model, j3ij describes the affinity between communities i and j, but now 
each vertex is assigned a probability distribution p over the set of communities (rather than being 
assigned a single community). 

2.4. Equivalence and identifiability. In this section, we determine when two different graphons 
lead to sequences of random graphs that are indistinguishable, in the sense that they are equal in 
distribution. As we will see, this is the case if and only if the two graphons are equivalent according 
to the following definition. 

Definition 2.5. Let W and W be graphons over (12, ^”, 71 ) and (12', T"', vr'). We call W and W 
equivalent^ if there exist two measure-preserving maps 0 and cf)' from (12, ^”, 71 ) and (12', T"', tt') to 

^We will not restrict the entries to be bounded by 1, since we want to consider normalized graphons, which become 
trivial if all entries are bounded by 1. 

®Our notion of equivalence is closely related to the notion of “weak isomorphism” from [14], the only difference 
being that in [14] the maps (j) and (j)' were required to be measure preserving with respect to the completion of the 
spaces (n, tt) and (fl', rr'). We will not use the term weak isomorphism since we want to avoid the impression 

that it implies that the underlying probability spaces are isomorphic after removing suitable sets of measure 0. It does 
not; see Example 2.7 for two equivalent graphons on non-isomorphic probability spaces. 
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a third probability space {U",T",7t") and a graphon U on (Q", tt"), such that W = and 

W' = almost everywhere. We call W and W' isomorphic modulo 0 if there exists a map 
(j)-. n — 7 - n' such that (j) is an isomorphism modulo 0 and W = almost everywhere. 

Theorem 2.6. Let W and W' be graphons over (Q,J^,7r) and (O', vr'), respectively. Assume 

that npn —)■ oo and that either max{||lT||oo, ||ld^^||oo} < 1 or /)„—)■ 0. Then the sequences 
{Gn{PnW))n>Q o,nd {Gn{pnW))n>o are identically distributed if and only ifW and W' are equivalent. 

For the dense case and bounded graphons, this follows from the results of [14] and [18] (or from 
those of [32], provided we only consider graphons over [0,1]). The sparse case and general (possibly 
unbounded) graphons is new, and relies on the theory of graph convergence for graphons. We 
prove it in the next section. 

The two representations W and W for the stochastic block model from Example 2.3 are clearly 
equivalent in the sense of Definition 2.5. In this case, we actually have W = for a measure¬ 
preserving map (j): [ 0 , 1 ] —)• [k] (namely the map which maps all points in the interval li to i G [/c]). 
But in general, equivalence does not imply the existence of a measure-preserving map (j) such that 
W = W'^ or = W. This is the content of the next example. 

Example 2.7. Let D = [4] and D' = [6], both equipped with the uniform distribution. Dehne W 
and W to be zero if both arguments are even or both arguments are odd, and set both of them to 
a constant p otherwise. It is easy to see that they are equivalent: indeed, let Ll" = {1,2} and define 
(f>: [4] —)■ [2] and fj: [6] —)■ [2] by mapping even elements to 2 and odd elements to 1. Setting U to 
1 if its two arguments are different and to 0 otherwise, we see that W = and W = . This 

shows that in general, we cannot restrict ourselves to a single, measure-preserving map (/>: D —)■ D', 
since there is simply no measure-preserving map between D and Ll'. 

But even if both probability spaces are [0,1] equipped with the uniform measure (in which case 
there are many measure-preserving maps between the two), we can in general not find a measure¬ 
preserving map such that W = or the other way around. To see this, let (pkix) = kx mod 1, 
dehne Wi{x,y) = xy, and let Wk = Wf’‘. Then there is no measure-preserving transformation 
(f)-. [0,1] —^ [0,1] such that W 2 = IL 3 or = W 2 ; see Example 8.2 in [44] for the proof. 

There is however, a special case where it is possible to just use a single map, namely the case 
where W and W' are twin-free Borel graphons. Here a graphon is called a Borel graphon if the 
underlying probability space is a Borel space., i.e., a space that is isomorphic to a Borel subset of a 
complete separable metric space equipped with an arbitrary probability measure with respect to the 
Borel ( 7 -algebra. A graphon W is called twin-free if the set of twins of W has measure zero, where a 
twin is a point x in the underlying probability space for which there is another point y such that 
VF(x, •) is equal to W{y, •) almost everywhere. Note that in Example 2.7 above, the graphons U 
and Wi are twin-free, while W, W, and for k > 2 are not. 

Theorem 2.8. Let W and W' be twin-free Borel graphons. Then W and W' are equivalent if and 
only if they are isomorphic modulo 0 . 

The theorem can easily be deduced from the results of [14], and is proved in Appendix A. 

To state our next theorem, we define a standard Borel graphon^ as a graphon over a probability 
space that is the disjoint union of an interval [ 0 ,p] equipped with the uniform distribution and the 
usual Borel ( 7 -algebra, plus a countable number of isolated points {xj}j^j with non-zero mass pj for 
each of them, allowing for the special cases where either the set of atoms or the interval [ 0 ,p] is 
absent. The former is the case of graphons over [0,1], while the latter is the case of block models 
over \k] equipped with a probability measure in A^. 

®Note that some authors use the notion of standard graphons or standard kernels for graphons with values in [0, Ij; 
here we don’t require such a condition. 
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Theorem 2.9. Let W be a graphon over an arbitrary probability space {Ll,iF,7r). 

(i) There exists an equivalent graphon over [0,1] equipped with the uniform distribution. 

(a) There exists a twin-free standard Borel graphon U and a measure-preserving map (f from 
(Q, F, tt) to the space on which U is defined such that W = almost everywhere, showing in 
particular that W is equivalent to a twin-free standard Borel graphon. 

Again, the theorem follows easily from the results of [14]; see Appendix A. 

Remark 2.10. (i) The above theorem states that for any graphon W, we can find both an equivalent 
graphon U over [0,1] and an equivalent twin-free standard Borel graphon U. But in general, it is 
not possible to find a single equivalent graphon U which is both twin-free and a graphon over [0,1], 
as the example of a block model shows, since any representation of it over [0,1] has uncountably 
many twins. 

(ii) As claimed in the introduction, the metric (1.1) is indeed a distance on equivalence classes; 
in other words, 5p{W, W') = 0 if IT and W' are equivalent. To see this, let W, W', (j), 4>' and U 
be as in Definition 2.5. Define a coupling dv{x,x") of tt" and tt by choosing x G D according to 
TT and then setting x" = (?i>(x). Using this coupling, it is easy to see that 5p{U, W) = 0. Similarly, 
Sp(U, W') = 0, which together with the triangle inequality proves the claim. 

(iii) When comparing finite graphs to graphons over [0,1], we will sometimes use a stronger 
version of the 5p distance. This distance extends the definition (2.1) to a distance between an n x n 
matrix A and a graphon W over [0,1], defined by 

(2.5) Sp{A,W) = mm\\\N[A^]-W\\p, 

a 

where, again, the minimum is over all bijections a: [n] —)■ [n] and {A^)ij = Ao-(i)o-(j), and where W[-] 
is defined in Example 2.3. 

We close this section with a theorem giving a different characterization of equivalence in terms of 
the metrics Sp and (5n. 

Theorem 2.11. Let p > 1, and let W and W he L'p graphons over two arbitrary probability spaces. 
Then the following statements are equivalent: 

(t) SniW,W^) = 0; 

(ii) Sp{W,W') = 0; 

(iii) W and W are equivalent. 

The theorem follows again from the results of [14], even though the details are a little more 
involved than for the previous two theorems and in particular make use of the fact that the infimum 
in (2.3) is actually a minimum if the underlying space is the unit interval; see Appendix A for the 
proof. 

2.5. Relation to graph convergence. As mentioned before, lU-random graphs arise very natu¬ 
rally as non-parametric models when considering a given graph as a finite subgraph of an infinite, 
exchangeable array, at least in the dense setting. Indeed, as the works of Hoover [43] and Aldous [6] 
show, any graph which is an induced subgraph of an infinite, exchangeable array can be modelled 
as a lU-random graph^ for some graphon W. 

A different window into the theory of lU-random graphs is given by the theory of graph convergence. 
Here one asks when a sequence of graphs Gn should be considered convergent. Motivated by extremal 

'^Strictly speaking, the results of [6,43] only imply that the extremal components of a infinite, exchangeable random 
graph are given by a graphon; see [32] for a review of this connection. But if we are given only one sample, the 
difference between an exchangeable random graph and an ergodic component is unobservable, since by the results of 
[62], a single observation of an exchangeable random graph only reveals one of the ergodic components, just like a 
single observation of an infinite set of coin-flips from an exchangeable sequence looks like a sequence of independent 
coin flips. 



14 


C. BORGS, J. T. CHAYES, H. COHN, AND S. GANGULY 


combinatorics, one way to address this question is to define a sequence of graphs to be convergent if 
the number of subgraphs isomorphic to a given graph H converges for every finite graph H, once 
suitably normalized. It turns out that in the dense setting, this notion is equivalent to many other 
natural notions of graph convergence that are relevant in computer science, statistical physics, and 
other fields [17-19]. 

One of these equivalent notions is convergence in metric, defined in terms of the cut metric (2.3). 
We say that a sequence of dense graphs converges to a graphon W in metric if (5n(Gn, W) —)■ 0 as 
n —7- oo. Note that the limit W is not unique, since two graphons W and W' which are equivalent 
have distance (In(W, W') < 5i{W^ W') = 0. The results of [14] imply that this is the only ambiguity: 
if W and W' are such that 5u{Gn, W) —)■ 0 and 5u{Gn, W) —)■ 0, then W and W' are equivalent. 

Given this notion of convergence, one may want to ask whether all sequences of graph Gn have 
a limit, or whether they at least have a subsequence which converges in the metric fa. For dense 
graphs, the answer to this question is yes and was given in [53], where it was shown that every 
sequence of dense graphs has a subsequence that is a Cauchy sequence in the metric fa, and that 
every Cauchy sequence converges to a graphon W over [0,1]. 

Thus the results of [53] completely parallel the results on exchangeable arrays of [6,43]: given 
an ergodic component of an infinite, exchangeable graph, one can find a graphon over [0,1] that 
generates this array, and given an arbitrary sequence of (random or non-random) dense graphs, 
one can find a subsequence and a graphon over [0,1] such that the subsequence converges to that 
graphon. In both cases, the graphon is identifiable only up to equivalence. Finally, combining [53] 
with [14], we know that if the sequence of graphs happens to be a sequence of IF-random graphs, 
then it converges a.s., and the generating graphon is a representative from the equivalence class of 
limits. 

The net result of this theory is that a convergent sequence of dense networks behaves like a 
sequence of IF-random graphs for some graphon IF and can thus be viewed as IF-quasi-random 
graphs. Having established this connection between IF-random graphs and IF-quasi-random graphs 
in the dense setting, one might ask whether it can be extended to a convergence theory for sparse 
graph sequences. It is clear that we cannot just simply consider Cauchy sequences in the cut metric 
fa, since all sequences of sparse graphs have this property. Indeed, by the triangle inequality 

Sn{Gm Gm) < ^l{Gni Gm) < Sl{Gn, 0 ) + 5i{Gm, 0 ) = p{Gn) + p{Gm)- 

But if instead of the graphon given by the adjacency matrix of Gn we consider the normalized 
adjacency matrix p ^{Gn), this argument no longer holds. 

This motivates the following definition. To state it, we define, for an arbitrary graph G with 
adjacency matrix A{G) and a constant c G M, the graph cG to be the weighted graph with adjacency 
matrix cA{G). 

Definition 2.12. Let IF be a graphon over an arbitrary probability space. A sequence of graphs 
Gn converges to IF in metric if 

In this case, we call Gn a W-quasi-random sequence with target density /3||IF||i. 


^G'n,IF) 


as n 


oo. 


Remark 2.13. This definition is an extension of the one given in [15] for graphons IF over [0,1]. 
There, as in the earlier literature on graph convergence for dense graphs, the distance between 
a graph G and a graphon IF was defined as the distance between IF and the embedding \N[G] 
of G into the space of graphons over [0,1], i.e., fa(G, IF) = fa(W[G], IF), with W[G] defined as 
in Example 2.3, by setting W[G] to Aij{G) on R x Ij, where /i,..., fa is a partition of [0,1] into 
adjacent intervals of lengths 1/n. In our setting, this embedding is not needed, since the cut distance 
(2.3) is defined on equivalence classes of graphons, and G and its embedding W[G] are equivalent. 
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Given the above definition of convergence for sparse graphs, one might ask whether this notion is 
again equivalent to other notions of convergence, and whether sparse Vb-random graphs converge 
again to the generating graphon. The answer to both questions is yes, with one exception: conver¬ 
gence of subgraph counts is no longer equivalent to convergence in metric.® But all other notions 
of convergence proved to be equivalent for dense graphs in [19] remain equivalent in the sparse 
setting, as shown in [16]. It is also again true that a sequence of IT-random graphs converges to the 
generating graphon. This is the content of the following theorem. 

Theorem 2.14. Let Gn = GniPnW) where W is a normalized graphon over an arbitrary probability 
space, and pn is such that npn —)• oo and either limsup/9n||bb||oo < 1 or pn ^ 0- Then a.s. 
p{Gn)/pn 1 and 

Proof. By Theorem 2.9, we can find a graphon W' over [0,1] that is equivalent to W. Since 
equivalent graphons lead to identically distributed random graphs, it is enough to prove the theorem 
for graphons over [0,1]. But for this case, it has been established in [15]. □ 

Remark 2.15. The above theorem has many interesting consequence for graphon estimation. In 
particular, assume that an algorithm releases an estimator W for the generating graphon W which 
is close in 5p for p > 1. These distances dominate the invariant distance di, which in turn 
dominates the cut distance dn. Combined with the results from [16] which state that many other 
notions of convergence are equivalent to convergence in metric (see Theorem 2.10), we obtain that 
consistent approximation for W leads to consistent approximations for various quantities of interest, 
such as minimal energies of graphical models defined on Gn (see Proposition 5.12 in [16], which 
actually gives a quantitative bound in terms of the cut distance) or collections of cuts in Gn (see 
Lemma 5.11 in [16], which again gives a quantitative bound). By Theorem 2.16 below, we also get 
good approximations for the empirical distributions of the degrees of Gn- 

Combined with Theorem 2.11, Theorem 2.14 immediately implies Theorem 2.6. 

Proof of Theorem 2.6. Let Gn = Gn{pnW) and G'^ = Gn{pnW'). Since (5n(^G'„,IT) — )■ 0 and 
^n{-^Gn, W) —)■ 0 by Theorem 2.14, we have (In(IT, W) = 0 if Gn and G'n are identically distributed. 
Since, on the other hand, Gn and G'n are clearly identically distributed if W and W are equivalent. 
Theorem 2.6 follows from Theorem 2.11. □ 


2.6. Convergence of degree distribntions. In this subsection we show that convergence in the 
cut metric dn implies convergence of the empirical degree distributions. We define the normalized 
degree of a vertex x G V{G) as dx/d, where dx is its degree and d is the average degree 


d = 


1 


E ^ 

x&V(G) 


^\E{G)\ 

\V{G)\ ■ 


The normalized degree distribution of G is the empirical distribntion of the normalized degrees, with 
cumulative distribution function 

DgW — |y/^\| X/ ^d,,<Xd- 

' ^ x&V{G) 


Q 

Indeed, it is possible to modify a sparse graph sequence by very little while greatly changing its subgraph counts: 
a IT-random graph with sufficiently low target density will have far fewer triangles than edges, so one can eliminate 
triangles without making any substantial change in the cut metric. For details, see the discussion after Theorem 2.18 
in [15]. 
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In a similar way, we define the degrees of a normalized graphon II^: x —)■ [0, oo) as the random 

variable 

Wa;= [ W{x,y)dTT{y), 

JQ 

where x is chosen according to the probability measure tt on fl. This random variable has cumulative 
distribution function 

Dw{X)=7r{{x:W,<X}). 

Recalling that convergence in distribution can be formulated as convergence in the Levy-Prokhorov 
distance, we say that the normalized degree distributions of a sequence Gn of graphs converge to 
the degree distribution of W if di,p{DG„, Dyp) —)■ 0, where as usual, the Levy-Prokhorov distance 
dpp between two distribution functions D and D’ is defined by 

dLPp, D') = inf{e > 0 : D' {X - e) - e < D{X) < D'{X + e) + e for all A G M}. 

Our next theorem implies that convergence in the cut metric implies convergence of the normalized 
degree distributions. Combined with Theorem 2.14, this gives that a.s., the normalized degree 
distributions of a sequence of IT-random graphs converge to the degree distribution of IT as long as 
npn —)• oo and pn —)• 0. Indeed, observing that for any graph G, the normalized degree distribution 
Dg is equal to the degree distribution of y4(G) considered as a graphon over V{G) equipped 

with the uniform distribution, both statements follow immediately from the following theorem. 

Theorem 2.16. Let U and IT be two normalized graphons. Then 

dhp{L)u, Dw) < \/2(5n([/, IT). 

The proof will make use of the following lemma. 


Lemma 2.17. Let U and IT be two normalized graphons over the same probability space fl. If x is 
chosen at random from fl, then 


Proof. We have 


- U,\ > e) < -\\U-W\\o. 


1 , 


PrdTG - U^\ >e)< -E[|IT, - U^\] 

= ^E[(IT, - U,)lw^>uJ + ^E[(I/, - W,)lw^<uJ. 

Defining S as the set of points x G It such that ITa, > Ux and S as the set of points x G Li such that 
Wx < Ld;, we write the right side as 


[ {W-U) + - [ (D-IT) <-||[/-IT||n, 

J[0,l]xS ^ ./[0,l]xS ^ 


1 

£ ./[ 0 , 1 ]) 


as desired. 


□ 


Proof of Theorem 2.16. To prove the theorem, we will prove that for two arbitrary graphons and 
all A G M and e > 0, 

( 2 . 6 ) Dw{X) < Du{X + e) + 2 ^^^^^^\ 

Because the degree distributions of equivalent graphons are identical, it will be enough to prove 
(2.6) for two graphons over [0,1], with an upper bound of \\U — IT||n instead of 5niU, IT). 
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To this end, we estimate the probability that Ux and Wx differ by at least e by Lemma 2.17. As 
a consequence, 

Dw{X) = Pr[W4 < A] 

< ^r[Ux < A + e] + Pi[\Ux - Wxj > e] 

<Du(A + s) + ^llU-Wllo, 

which proves (2.6) and hence the theorem. □ 

2.7. Existence of approximating block models. Having seen that block models are special 
cases of IT-random graphs, one might wonder how well an arbitrary graphon can be approximated 
by a stochastic block model. The answer is given by the following lemma. To state it, we 
recall the definition of as the set of all block models with minimal block size at least k, 
S>K = {(P,-B) G B: mini Pi > k}. 

Lemma 2.18. Let I < p < oo, let W be an LP graphon, and let e>^(lT) be as in (1.2). Then the 
infimum in (1.2) is achieved for some W’ G B>k, that has norm ||VL^||p < 2||lT||p. Furthermore, 
e>^(lT) —^0 as K —^ 0. 

Proof. We clearly have e>^(VL) = dp{W, W') < ||lT||p, so by the triangle inequality, we only 
need to consider block models W with ||VL'||p < 2||lT||p. Again by the triangle inequality, the 
distance 5p{W,W') is continuous in W', which implies that the infimum is actually a minimum. 

To see that SylfW) —)■ 0 as a —)• 0, we first replace W by an equivalent graphon U over [0,1], and 
then use the approximation Un to U given by averaging over the partition consisting of consecutive 
intervals of length 1/n. This approximation is a block model with minimal block size 1/n, and it 
converges to U by the Lebesgue differentiation theorem and a truncation argument (see Lemma 5.6 
in [16]). □ 

When applying the lemma, we will sometimes be constrained to use only block models whose 
block sizes are all a multiple of 1/n, i.e., block models in 

13n,>K = {(P) -S) £ B: for all i, pin G Z and pin > [nnj}. 

Note that Bn,>K naturally corresponds to the set An,>K of n x n block matrices A such that each 
block in A has size at least [unJ , via 

(2.7) {W[A] : A G A„,> J = {W[p, B] : (p, B) G J. 

Our next lemma shows that every block model in can be well approximated by a block model 
in Bn,>K, and it also shows that can be bounded from above in terms of a minimum over Bn,>K- 
It is proved in Appendix C. 

Lemma 2.19. Let k G (0, Ij. Then there exists a constant no(n) such for all p > 1 and all LT 
graphons W, the following holds: 

If W G is a block model on [k], then the labels in [k] can be reordered in such a way that for 
each n > 1/n there exists a block model W” G An,>K with 
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2.8. Convergence of VF-weighted graphs. Recall that by Theorem 2.14, the sequence Gn = 
GniPnW) converges to W in the cut metric if IT G LP is normalized, npn —)• oo, and either 
limsup /On||lT||oo < 1 or /3jj —7- 0. Our next lemma, which is a slight strengthening of Theorem 2.14(a) 
from [15], states that for the weighted graphs QniPnW), the same holds in the tighter distance 
Sp. Recalling that for any graphon, we can hnd an equivalent graphon over [0,1], we will restrict 
ourselves to the case where IT is a graphon over [0,1], in which case we can use an even tighter 
distance, the distance 5p defined in (2.5). 

Lemma 2.20. Let p > 1, let W he a normalized graphon over [0,1], let xi, X 2 ,... G [0,1] he 
chosen i.i.d. uniformly at random, and let pn he a sequence of positive numbers such that pn 0. 
Given n>2, let Qn he the nx n matrix with entries va\n{\, pnW{xi,Xj)} , relabelled in such a way 
that xi < X 2 < • • • < Xn- Then a.s. ||^W[(5„] — lT||p —)■ 0, so in particular p{Qn)/pn —^ 1 ind 

4(^Qn,lT)^0. 

The lemma is proved in Appendix C. The next lemma is a quantitative version of Lemma 2.20 
for block models, and is also proved in Appendix C. 

Lemma 2.21. Let C be a positive real number, let k G (0,1), and let IT' be a block model with 
minimal class size at least k, represented as a graphon over [0,1]. ^ log n < C, then 

6piHn{W'),W') = Op(^^^l^ 

and if K = Kn is such that limsup ^ logn < C, then with probability one, there exists a random no 
such that for n > no, 

5p{Hn{W'), W') = o(^ 

Here the constants implicit in the big-0 and Op symbols depend only on C. 

3. Least squares estimation 

In this section, we prove the following theorem, which shows that the least squares estimator is 
consistent. To state the theorem, we define 

tailjf)(lT) = ||lT-min{lT,p"^}||p. 

These tail bounds are easy to estimate when IT is an LP' graphon for some p' > p, in which case 
they decay as a power of p: 

= WiW - p-^)lw>p-4p 

< \\wiw>p-A\p 

< \\w{WpY'/^-^\\^ 

= IIW'IlF”. 

When IT is an L^' graphon for p' = p but not p' > p, tail bounds become more subtle, but it remains 
the case that 

tailJf)(lT) ^ 0 

as p —)■ 0. 

Theorem 3.1. Let W be an L^ graphon, normalized so that \\W\\i = 1, and let W = {p,B) be 
the output of the least squares algorithm (1.3) for a W -random graph G on n vertices with target 
density p. 
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(i) If K e {n ^,1] and = 0{pn), then 


*I^W'.»')=Op|41 




(ii) If K £ (0,1] is fixed and p = pn is such that pn ^ 0 and npn —)• oo, then 


52[-W,W 

\p 


f2) 

s^^{W) with probability 1. 


(Hi) If p = Pn and k = Kn are such that pn —> 0, npn —> oo, Kn 
as n ^ CO, then 

S 2 (^-W,W ^^0 with probability 1. 


0, and K„^log(l/Kn) = o{npn) 


Conceptually, the proof of Theorem 3.1 is based on the following two observations. First, for any 
map tt; [n] —)• [k] and any k x k matrix B, 

1141(G) - B-g = \\A{G)g - 2{A{G), Bg + \\Bg\l 

Therefore, the argmin of the left side is the same as the argmax of 2{A{G), B'^) — ||i?^|| 2 - Second, 
conditioned on the weighted IF-random graph Q = QniPnW), 


E 


2{A{G),Bg-\\Bg\l =2{Q,Bg-\\B 


|2 

12 - 


Up to errors stemming from imperfect concentration, we therefore expect that the argmin {B, tt) 
from (1.3) is a maximizer for 2{Q,B'^) — ||i?’^|| 2 ) and hence a minimizer for HQ — B'^g. Thus, we 
would expect that, again up to issues of concentration, the error is bounded by e>^(Q), where 
for an arbitrary n x n matrix H, 

min \\H - Bg. 

For bounded graphons, this strategy was implemented in [20], leading to (i) a proof of consistency 
for all bounded graphons W and (ii) a differentially private algorithm achieving the same goal under 
slightly less general conditions (requiring pn to grow at least like logn). For the case of general if 
graphons, the above motivation still lies behind our proof, but the actual implementation proceeds 
along slightly different lines, and combines elements of the (sparse graph) strategy of [20] with 
elements of the (dense graph) strategy developed in [37]. The resulting estimates are stated in 
Theorem 3.2, which bounds the if difference between the output of the algorithm (1.3) and the 
matrix Q in terms of e>^(Q) and an error term representing errors from imperfect concentration. 
To obtain Theorem 3.1 from Theorem 3.2, we will need to transform an estimate on the if error 

with respect to Q into an if error with respect to W, and we will want to express the result in 
( 2 ,') ( 2 ^ 

terms of e>^(lU) instead of e)^gQ). This leads to two extra error terms, the last two terms in the 
bound of statement (i) in Theorem 3.1. 

Before stating Theorem 3.2 formally, we recall that any block model W G Bn,>K can be represented 
by an n X n matrix MniW) £ An,>K such that W and MniW) are equivalent as graphons; see (2.7) 
and the discussion preceding it. 


Theorem 3.2. Let W be a normalized if graphon, let 0 < p, k < 1 and n G N, and let G = Gn{pW), 
Q = Qn{pW) and let W = {p,B) be the output of the least squares algorithms (1.3) with input G. 
If nn > I and = 0{pn), then 


52 M, 


{w),q) 


^ + Op I 


'l + log(l/K) 
K^pn 
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where the constant implicit in the Op symbol depends on the norm ofW. 

If P = Pn is such that npn oo and pn —)• 0, then almost surely, for n large enough and all k 
with n/t > 1 and = 0{pn), 

h{MJW),Q) < 

where again the constant implicit in the big-0 symbol depends on the norm of W. 

Proof. Let M = A = A{G), andk=\^]. 

As a first step, we will prove that 

(3.1) S 2 (m,Q) < £>^((5) + 2A;e + 2Y^A;e||(5||2 where e= max ||A^ - QttHi- 

V / - it: [n]— 

To prove (3.1) we note that M = Mn{W) is a minimizer of ||A — M ||2 over all M G An,>n- As a 
consequence, 

-2(A,M) + ||M||i < -2(A,M) + ||M||2 

for all M G An,>K, which in turn implies that 

62 (^M,Qy < \\M-Q\\l 

< ||M||i-2(M,g) + ||Q||^ + 2(M-M,A) 

= \\M-Q\\l + 2(^M-M,A-Qy 

Since M, M G An,>K,, we know that there are partitions vr, tt; [n] —)• [k] such that M = Mjr, M = M^-, 
and all non-empty classes of vr and tt have size at least [/inj. As a consequence 



|(M, A - Q)| = |(M, (A - Q)^)\ < ||M|U||(A - g)^||i 

71 

<e||M|U<£| -r||M||2<te||M||2, 

I AvTlJ 

where in the second to last step we used that M is an n x n block matrix such that each block 
contains at least elements. Bounding |(M, A — g)| in the same way, we find that 

h{M,Qy < ||M-g||2 + 2te(||M||2 + ||M||2). 

Bounding ||M ||2 = <52(0, M) < \\Q \\2 + d 2 (yM and ||M ||2 < ||g ||2 + ||Af- g|| 2 , a small calculation 
then shows that 

(<52(M,g) -ke^ < (||M-g||2 + A:e)^ + 4te||g||2. 

('2') 

Choosing M in such a way that e>^(g) = \\M — Q\\ 2 , this proves (3.1). 

For all tt: [n] —)■ [ k ], we have E[A^ | g] = Q.,^. Using this fact and a concentration argument, one 
can show that conditioned on Q, with probability at least 1 — e”” 

(3.) e , 

whenever p{Q)n > 1; see Lemma B.2 in Appendix B. The lemma also gives a bound on the 
expectation, implying in particular that conditioned on Q, 


e = Or. 


IpiQ) 


1 -|- log k ^ k'^' 


n 
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whether or not the condition p{Q)n > 1 holds. 

Since E[/)((5)] < /9||VF||i = p and E[||(5|||] < /9^||VF||i, this proves that 


-2 + Op p\ 


'k?{l + log fc) ^ A:^ 


pn 


pn^ 


with the constant implicit in the Op symbol depending on ||M^||2- To transform this bound into the 
bound in the statement of the theorem, we observe that for k = 1 , k = is equal to while 

for A < 1, the assumption uk > 1 implies that k = In either case, 


/c^(l + log k) 


= O 


and 


ui 




n 


= 0 


1 + log(l/K] 




1 + log(l/N) 


= O 


1 + log(l/N] 
K^n 


where in the last step we used that the assumption implies that = 0(1). 

Thus, 


2te + 4Pmh = O, 

= Op( p] 


4/1 + log(l/K) 




again because = 0(1). This completes the proof of the bound in probability. 

To prove the a.s. statement, we note that by Lemma 2.20, p{Qn)/Pn Ij which together with 
hypothesis that npn 00 implies that almost surely, np{Qn) > 1 holds for sufficiently large n, 
which allows us to use the bound (3.2). By a simple union bound, this bound holds for all /c < n 
with probability at least 1 — ne~'^. Since the failure probability is summable, we conclude that there 
exists a random no (depending on W and the sequence pn, but not on k or k) such that the bound 
(3.2) holds for all n > uq and all k < n. Combined with the fact that by the law of large numbers 
for [/-statistics (see Lemma C.l in Appendix C), ^ llH^lb a.s. as n —)■ 00, we obtain the 

almost sure statement of the theorem. □ 


Proof of Theorem 3.1. Let (11, tt) be the probability space on which W is defined, and let Q = 
Qn{pW) as before. Defining Wp = min{lT, l/p}, we will write Q as pHn{Wp) and tailp^^(lT) = 
||lT-lTp||2. 

By the triangle inequality and the fact that the 82 distance dominates the 82 distance, we have 

82 {^W,W^ =52(^M„QiT^,1T^ 

^3.3) / / \ \ / \ 

< h (Mn (^wj , -^Qj + 82 Qg, wj. 


To bound the first term on the right side, we will use Theorem 3.2 and then bound in 

terms of e>^(lL). 

Recall that by Lemma 2.18 the infimum in the definition (1.2) of eyf{W) is a minimum, and the 
minimizer W' G satisfies ||1T^||2 < 2||11/||2. As established in Lemma 2.19, we can relabel the 
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blocks of the block model W in such a way that 

r4~ [16 

52{W'[\N[W']) < \ —WW'h < —\\W\\2 = \ —\\W\\2 

\ nn \ Kn \ nn 

for some W” G An,>K- Setting W' = W[kF'], we find that 


<h{-Q,W'] + \l^\\W \\2 

\p J \ Kn 

= S2(Hn{Wp),W') + J —\\W\\2. 

Next we would like to choose a coupling /x of p and tt such that 

eP^iW) = ,52(W"', W") = ||kk' - Wh,^, 

where || • ||2,^ denotes the norm with respect to the coupling p. (This an abuse of notation, but 
it is more convenient than writing out the formula, as in (1.1).) Such a coupling needn’t exist, but 
that is not a significant obstacle. We could complete the proof by looking at couplings that come 
arbitrarily close to the oracle error, but instead we will switch to equivalent graphons over [0,1], 
because Lemma A.l then guarantees the existence of an optimal coupling. The oracle error and 
tail bounds are invariant under equivalence, so we can assume without loss of generality that the 
coupling p exists. 

We use this coupling to couple the random graphs Q{pW) and Q{pW'). With the help of the 
triangle inequality, we then conclude that 


(3.4) 


< \\Hn{Wp) - Hn{W)\\2 + \\Hn{W) - Hn{W')\\2 

V y \ Kn 


After these preparations, we start with the proof of (i). To this end, we first use the triangle 
inequality and the fact that 52 {W[ W) = e>^(lT) to bound 


S2(^^Q,W^ = 62 {Hn{Wp),W) 

< \\Hr,{Wp) - Hn{W)\\2 + \\Hn{W) - Hn{W')\\2 

+ S2{Hn{W^),W^) + ef,{W). 

Next we estimate 

E [\\Hn{Wp) - HniW)\\2] = E [\\Hn{Wp - W)||2] < [\\HniWp - W)\\l] 

= \\Wp-W\\2 = tBl\f\W) 

and 

E [\\Hn{W) - Hn{W')\\2] < ||kL - W'112,;. = 

Since ^2 (^Hn{W'), W'^ has the same distribution as S 2 (^Hn{W'), , we may then use Lemma 2.21 

and the fact that ||kL'||2 < 2||1T||2 to conclude that 


?( 2 ) 



+ 41{W) + ^ 1 ^ 
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(Note that (1 + \og{l/ k))k ^ = 0{pn) implies that = 0{k) and hence logn = 0{Kn), as 

required for the application of Lemma 2.21.) In a similar way, we use the fact that 52 {Hn(W'), W') 
has the same distribution as S 2 {Hn{W'), W) to conclude that 

52 Qg, w'^ = Op |^tail(2)(II^) + eg(hL) + ^J^\\w\\2^ . 

With the help of (3.3) and Theorem 3.2, this implies that 

fe (W, w) = O, (taUf (WO + ef^(W) + ^11 WOb + . 

which concludes the proof of (i). 

Next we prove (ii). Since W is square integrable, ||IT — Wp \\2 —)■ 0 as p —)■ 0, so by combining the 
law of large numbers for [/-statistics (see Lemma C.l in Appendix C) with a simple two e argument, 
we conclude that a.s., the first term in (3.4) goes to zero. Again by the law of large numbers for 

/ o\ 

[/-statistics, the second term goes to ||IT' — VL||2,/i = £>^(IT), and by Lemma 2.21 and the fact that 
HniW') and Hn{W') have the same distribution, the third term goes to zero as well. Thus a.s., the 
right side of (3.4) goes to e>^(IT). Combined with (3.3), Lemma 2.20, and Theorem 3.2, we see 
that for fixed k, 

limsup52 (-hL, bL ) < ^>i(bL) with probability 1. 
n^oo \p J ~ 

On the other hand, by the second bound in Lemma 2.19, 

^>l(bh) < liminf min SpiW",W). 

Since G B>K,n, this gives e>^(IT) < liminf^^oo ^2 IT^ , completing the proof of (ii). 

To prove (hi), note that the condition K“^log(l/An) = o{npn) implies in particular that Hn^/n —)■ 
oo, which in turn implies that logn —)■ 0. We may therefore again use Lemma 2.21 to show that 
the third term in (3.4) goes to zero a.s. The first term does not depend on n, and hence goes to zero 
just as before, but now the second term goes to zero as well, by a two e argument invoking now the 
fact that HIT' — IT||2,^ = e>^^(IT) —)■ 0. Since the condition K“^log(l/nn) = o{npn) clearly implies 

that riKn —> oo, we conclude that a.s., 0. Combined with (3.3), Lemma 2.20, and 

Theorem 3.2, this implies (iii). □ 


4. Cut norm estimation for general graphons 


In this section, we prove the following theorem, which shows that the least cut norm estimator is 
consistent. 


Theorem 4.1. Let W be an graphon, normalized so that 
output of the least eut norm algorithms (1.5). 

(i) If k£ 1], then 


1 = 1, and let W = {p, B) he the 


5n(^W,W) =Op\e 


.-( 1 ) 


(^) + \/^ + 

\ pn \ Kn 


(ii) If K G (0,1] is fixed and p = pn is such that pn ^ 0 and npn —)• oo, then 


1 - 


Iimsup5n|^-IT, IT) < 2e>^(IT) with probability 1. 
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(in) If p = Pn and k = Kn are such that pn —)• 0, npn —)• oo, —)■ 0, and —)• 0, then 

5u{^W , ^ D with probability 1. 

The proof relies again on a concentration argument, this time starting from the observation that 
for all S,T C [n], 

(4.1) E Axy{G) = Qx,y 

(x,y)£SxT {x,y)£SxT 

Therefore, up to issues of concentration, minimizing the cut distance between A{G) and a block 
model in B>K,n is the same as minimizing the cut distance between Q and a block model in 
In other words, up to issues of concentration, one might hope that the distance between Q and the 
output W of the algorithm (1.5) is just e>K,n((5), where for an arbitrary n x n matrix H, 

£>KniH) = min 

It turns out that we lose a factor of two with respect to this optimum, due to the fact that in (1.5), 
we optimize over all block matrices of the form A(G).^, rather than all block matrices that are 
constant on the blocks determined by vr. While these two minimizations are equivalent in the least 
squares case, they are not here, leading to the loss of a factor of two.® 

The following theorem states our approximation guarantees with respect to Q. Theorem 4.1 
follows from it in essentially the same way as Theorem 3.1 follows from Theorem 3.2. To state it, 
we recall the definition (2.2) of the distance dn. 

Theorem 4.2. Let W be a normalized graphon, let < p <1 and n G N, and let G = Gn{pW) 
and Q = Qn{pW). If k £ (n“^, 1] and W = {p, B) is the output of the least cut norm algorithm 
(1.5) with input G, then 

h(Mn{W),Q) <‘2e>yu{Q) + Oy(^p.j^. 

If P = Pn is such that npn —)• oo and pn —> 0, then almost surely, for n large enough and all 
K £ (n-^ 1], 

h{Mn{W),Q) < 2e>,,n(g) + o(^py^^. 

Proof. Let A = A{G) and k = We will show that 

(4.2) h(Mn{W),Q) <2e>,,n(g)+3||g-Gl||n. 

To this end, we first prove that 

(4.3) \\MniW)- A\\u<2 min ||M-^||n. 

MeAn,>K 

To see this, we note that An,>K consists of all n x n matrices M such that M = Tg for some 
TT; [n] —)• [k] such that the smallest non-empty class of vr has at least size [auJ . Next we observe 
that for all vr: [u] —)• [/c], the map H i—)■ is a contraction in the cut norm. As a consequence, for 
all n X n matrices M with M = 

WAtt - A\\a < \\At, - Merlin -I- ||M - A\\a < 2\\M - A\\a. 

Because Mn{W) = A* for some vr: [n] —?• [k] that minimizes ||A — AirUn over all tt whose smallest 
non-empty class has size at least [auJ , the bound (4.3) now follows. 

®At the cost of an even slower algorithm, this could be cured by redehning the algorithm (1.5) to optimize over all 
block matrices that are constant on the blocks determined by tt. 
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After this preparation, the proof of (4.2) is straightforward. Indeed, 

h{Mn{W),Q) < \\Mn{W)-A\\u + \\A-Q\\u 

<2 min \\M - A\\n + \\A - Q\\n 

<2 min \\M - Q\\n + S\\A - Q\\n 

M£An,>K 

= 2e>K,n((3) + 3||(5 — ^lln- 

From here on, the proof proceeds along the same lines as that of Theorem 3.2, this time starting 
from the observation (4.1). Using this fact and a concentration argument, we now can show that 
conditioned on Q, if p{Q)n > 1 then 

||g-^||n < 15 

holds with probability at least 1 — e“"^, and 



\\Q - A\\a 



independently of the condition p{Q)n > 1; see Lemma B.3 in Appendix B for details. The assertions 
of the theorem now follow. □ 


Proof of Theorem 4-1. Keeping the notation from the proof of Theorem 3.1, and using the fact that 
the distance dn is dominated by the distance di, we now bound 

(4.4) fa QlU, < fa (^Mn + 5i Qq, W 

Using Lemma 2.18 and Lemma 2.19 for p = 1, we now bound 

+ — = Si(h4Wp),W') + —, 

~ \p J \p J nn V / KTi 

where W' is a minimizer for (1.2) for p = 1, with ||IU'||i < 2||IU||i = 2, and W' again stands for 
W[IU']. Writing e>^(IU) as e>^(IU) = 8i{W', W) = ||IU' — for some coupling of p and vr 

(which we can assume exists without loss of generality by passing to equivalent graphons over [0,1], 
as in the proof of Theorem 3.1), we then get 

e>n,u{^Q^ < \\Hn{Wp)-Hn{W)\\i + \\Hn{W)-Hn{W')\\i 

+ UHn{W'),W)+ — 

\ J nn 

and 

<5i(^Q,Ik) < \\Hn{Wp)-Hn{W)\\i + \\Hn{W)-Hn{W')\\i 

+ <5i(iL„(IU'),IU') + eg(IU), 

where as before Hn{W) and Hn{W') are coupled with the help of p. From here on the proof of 
Theorem 4.1 proceeds exactly as the proof of Theorem 3.1, with the condition ^ logn = 0(1) that 
is needed to apply Lemma 2.21 guaranteed by the hypotheses of the theorem. We finally arrive at 


(Iq) = O, CtailW(H') + E™ (VF) + 
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and 

Qq, if) = . 

With the help of (4.4) and Theorem 4.2, this gives the bound in probability. 

The almost sure statements are proved similarly. □ 


5. GrAPHON estimation via degree SORTING 


In this section we analyze the behavior of the degree sorting algorithm described in the introduction. 
We will use the notation from Section 2.6 for degrees and the degree distribution. 

Theorem 5.1. Let W he a graphon whose degree distribution function Dw- [0,oo) —)■ [0,1] is 
continuous, let Gn be a W-random graph on n vertices with target density pn, and let Wn be the 
output of the degree sorting algorithm with kn parts and input Gn- 

Suppose Pn —)• 0, npn —> oo, kn —)• oo, log kn = o{npn), and kn = o[n,y~pff) as n ^ oo. Then 
Pn^Wn converges a.s. to W under 6i. 

Note that Dw is continuous if and only if the degree distribution of W is atomless. Graphons 
with this property have a useful characterization as graphons over [0,1]: 

Lemma 5.2. The degree distribution function Dw of a graphon W is continuous if and only ifW 
is equivalent to a graphon U over [0,1] whose degrees Ux are strictly decreasing in x. 

Proof. Every graphon W is equivalent to a graphon U over [0,1], and via monotone rearrangement 
we can furthermore assume that Ux is weakly decreasing in x (see [59] for a thorough discussion of 
the measure-theoretic technicalities). Then Djj = Dy/, while Djj is continuous if and only if Ux is 
strictly decreasing. □ 


If IT is a graphon over (Ll, T, tt) and P is a partition of 11 into finitely many measurable pieces, 
then Wp denotes the step function defined by 


Wp{x,y) 


1 

7r(/)7r(J) 



IT(u, v) d'K{u) d'K{v) 


whenever x is in the part I oiV and y is in the part J. (This is not well defined for parts of measure 
zero, but they can be ignored.) We will need the following sufficient condition for when averaging 
over partitions converges under the norm. 


Lemma 5.3. Let IT be an L^ graphon over [0,1], and let Vi,V 2 , ■ ■ ■ be partitions of [0,1] into 
finitely many measurable pieces. Let pn,e be the probability that independent random elements 
x,y £ [0,1] satisfy \x — y\ > e, conditioned on x and y lying in the same part ofVn- If 


for each e > 0, then 


lim pn,s = 0 

n^oo 


lim ||ITp„ 

n^oo 


ITlIi = 0. 


Proof. Without loss of generality we can assume that IT is continuous, because continuous functions 
are dense in and 11 ITp^ — IT),^ 11 1 < 11 IT — IE' 11 1 . 

Let Ji,..., Jn be the parts of Vn- Then for (x, y) £ .Ji x Jj, 

(x, y) = . / . . I W(u, v) du dv. 

By combining this formula with 

Wix,y)= .f T\ I w{x,y)dudv, 
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we find that 


N 


ll^Pn - ^lli < X(J)X(J-) / / \w{u,v)-W{x,y)\dudvdxdy. 

— \ i) \ J ) J JiY. Jj J JiY. Jj 

Because W is continuous on [0,1]^ (and hence uniformly continuous), for each (5 > 0, there exists 
e > 0 such that \W{x^y) — W{u,v)\ < 5 whenever \x — u\ < e and \y — v\ < e. Then 


^ 


N 


< 






JjiXjj JjiXj. 


— «| > £ or |y — r;| > £ 
l|a; -u\> edudv dx dy 


<^ + 4||fT||oo V Tw'yT / l\^_u\>edudx 
i=l O 


Ji^ Ji 


It follows that 


= 5 + 4||IT||ooP„,£. 

lim sup 11 ITp„ — W\\i<5 


for each d > 0, as desired. 


□ 


Proof of Theorem 5.1. By Lemma 5.2, we can assume that IT is a graphon over [0,1] for which the 
degrees ITr are strictly decreasing in x. 

Let li^n = [(^ — 1)/^) so that Ii^n, h,n, ■ ■ ■, dn,n form a partition of [0,1] (up to the measure- 
zero set of their endpoints, which we will ignore). We will assume the vertices of Gn are ordered 
so that the corresponding sample points in [0,1] satisfy xi < X 2 < • • • < Xn, and we view Gn as a 
graphon over [0,1] via the blocks li^n and this vertex ordering. 

Let di,... ,dn be the vertex degrees, and set d = (di -|- • • • + dn)/n. Recall that the degree sorting 
algorithm works as follows. We choose a permutation a of [n] such that 


da(l) P dn(^n) P ' ' ' P dn-(^n) 


and integers 0 = no < ni < • • • < = n such that 

in 
k 


Ui — 


< 1 . 


Then we define vr: [n] —)• [/c] by 7r(j) = i if Ui-i < a{j) < n*. The output of the algorithm is the 
block model IT = (p, B) with pi = 1/k and B = A{G)/ti. 

Let Ti,..., Tfc be the preimages of 1,..., A: under tt, and set 

di — dj^n- 

j&V, 

Then Ji,..., form a partition Vn of [0,1], and IT^ is equivalent to {Gn)p„- (Recall that we view 
Gn as a graphon over [0,1].) We wish to prove that 

6i{pn\Gn)r„,W)^0. 

In fact, we will prove that \\Pn^{Gn)'Pn ~ R^lli d, given our ordering of the vertices of Gn. 

We will use the notation established in previous sections, such as Qn for the weighted random 
graph used to generate Gn- Recall from Lemma 2.20 that a.s. p{Qn)/pn —^ 1 and —IT||i —)■ 0. 
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We begin with the inequality 

\\pn^i.Gn)Vr. - W^lll < \\pn^{Gn)v^ - Pn^i.Qn)v^\^ + \\pn^{Qn)Vn - Pn^Qr. 

+ WQn-W\\^. 

The third term on the right tends to zero a.s. For the first term, we have 

\\Pn^{Gn)Vr. - Pn^{Qn)Vu\\i = Pn^\\{Gn)v„ - {Qn)rn\\i- 

By Lemma B.2 and the fact that p{Qn)/Pn —^ 1 a.s., we can bound ||(G'n)p„ — (f3n)p„||;^ by 
/ / 

o\ 

that 

0 . 


^ ^^r+^g_fc _|_ a.s., and thus the hypotheses that log/c„ = o(npn) and kn = o[n^,/pn) imply 

Pn iGn)'Pn ~ Pn {Qn)'Pn 


-""111 

All that remains is to handle the second term, namely \\Pn^{Qn)Vn ~ Pn^QnW^- Because ||/3^^(5n — 
VF||^ —^0, it will suffice to show that ||PFp,^ — hF||^ —)■ 0. We will do so using Lemma 5.3. 

Fix e > 0, and let pn,e be the probability that independent random elements x,y G [0, 1] satisfy 
\x — y\ > s, conditioned on x and y lying in the same part of Vn- By contrast, let be the 
probability that \x — y\ > e and both points lie in the same part of Vn-, without the conditioning. 
Because each part Jj of Vn satisfies \{Ji) = (1 + o(l))//c„, proving that pn,s —)• 0 is equivalent to 
proving that knp'ne b- Thus, to apply Lemma 5.3, we must show that knp'r 


n,£ 


0 . 


Instead of analyzing the points x and y, it will be convenient to consider the intervals V^n and 
Im n containing them. We will use the bound 


(5.1) 


Pne < U{i) = Tr{m) and max{\x - y\ : x G Ie,n,y ^ Im,n} > s) 

’ e,m&[n] 

= Pr (vr(I') = 7r(m) and \i/n — m/n\ > £ — 1/n), 


where of course Pr^,me[n] denotes the probability if i and m are chosen uniformly at random from 
[n]. 

To analyze these probabilities, we need to bound how close the degrees in are to those in W. 
Lemma 2.17 will provide suitable bounds. To apply this lemma, we must quantify how quickly the 
degrees in IT change as a function of distance. Let 


(5 


inf 

\x—y\>£/A—l/n 


\W,-Wy 


Because x i—)• ITa, is strictly decreasing, d > 0. Call an element i G [n] good if the normalized degree 
di/d is within (5/3 of ITa, for some x G h^n- Taking U = p~^Gn in Lemma 2.17 shows that the 
fraction of bad elements is at most 

^lIPn^Gn - lT||n, 

which tends to zero as n —)• oo. If i and j are good and \i/n — j/n\ > e/4, then 


di 

1 


dg 

d 


> s/3. 


If follows that if i and j are good and \i/n — j/n\ > 3e/4, then at least the middle [ne/4j vertices 
between i and g have degrees strictly between di and dj. When n is large enough, this is much 
larger than the number of vertices in any part of Vn- In particular, if n is large enough then good i 
and j with \i/n — j/n\ > 3e/4 cannot possibly end up in the same part after the degrees are sorted. 
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Thus, by (5.1), 

Pns ^ ^ is bad and 7r(£) = 7r(m)) 

’ £,m£[n] 

<2 Pr [i is bad and 7r(l') = 7r(m)) 

£,m£[n] 

< 2 Pr U is bad) maxA(Jj) 

mS[n] i 

It now follows from \\p~^Gn — W\\a —)■ 0 that knp'ne 0; desired. 

6. Holder-continuous graphons 


□ 


In this section, we analyze the least squares and the least cut norm algorithms for the case of 
Holder-continuous graphons. As discussed in the introduction, our approach allows us to reduce 
this to the analysis of the two error terms tailp^^(IT) and e>^(VP) for p = 2 and p = 1, respectively, 
which reduces the analysis to pure approximation theory. 

Throughout this section, we consider graphons W over (equipped with the standard Borel 
cj-algebra and some probability measure tt) that are a-Holder-continuous for some a G (0,1], i.e., 
graphons W for which there exists a constant C such that 

\W{x,y) — W{x',y)\ < C\x — x'\^ for all x,x',y G 

with I • loo denoting the distance on (note that we only require this for one of the two 
coordinates of W, since for the other one it follows from the fact that W is symmetric). We denote 
the set of graphons obeying this bound by Tic,a- If we restrict ourselves to graphons on a subset A 
of we use the notation ^c'^a(A). 

Our first proposition concerns the case when the support of the underlying measure vr is compact, 
in which case we may assume without loss of generality that tt is a measure on A^ = [—i?, for 
some R G [0,oo). Note that many examples of IT-random graphs considered in the statistics and 
machine learning literature fit into this setting, e.g., the mixed membership block model of [4]. Note 
also that while these models can be mapped onto IT-random graphs over [0,1] with the uniform 
distribution by a measure-preserving map, such a map will typically not do this in a continuous 
way. So if one wants to use continuity properties of the generating graphon IT, one has to analyze 
it on the original space on which it was defined, not on [0,1]. 

Proposition 6.1. Let d > 1, R £ [0,oo), a G (0,1], and C < oo, let tt be a probability measure 
on Aji C and let W be a normalized graphon in 'Hc,a{Aji). Then there exists a constant D 
depending only on R, C, and a such that the following holds: 

(i) We have ||IT||oo < D. So in particular 

tailJf)(IT) = 0 ifp<^. 

(a) For p > I and k> 0, 

( 6 . 1 ) < 4Dk^'. 

where a' = If tt is the uniform measure, then the bound (6.1) holds for of = ajd. 

Proof. We will prove the proposition for D = 1 + 2C'(2ii)“. 

To prove the first statement, let Co = W{x,y). Since Co = f Co < ||IT||i = I, Holder 

continuity implies that ||IT||oo < I + 2(7(21?)“ = D. 

To prove the second statement, consider /c G N, and let V be the partition of Ajj into cubes of 
side-length a = 2R/k. For a given class Y £V, two points x,x' gY have distance \x — x'|oo < a. 
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Thus, if Y and Y' are two classes in V, then \W{x, y) — W (x', y')\ < 2Ca°‘ = 2C{2Rk~^)°' whenever 
x,y gY and x',y' G Y'. As a consequence \\W — lTp||oo < 2C{2Rk~^)°‘ < Dk~°‘. 

If TT is the uniform measure over A/j, then each class Y oiV has measure 7r(y) = k~^, so setting 
k = we obtain k~^ < and thus 

which proves the proposition for the case of the uniform measure. (Recall that 5p and hence etliW) 
are decreasing functions of p.) 

But for general measures, some of the classes of V might have tiny measure. To fix this, we 
merge all classes of measure less than k (where k will now be smaller than k~'^) with the smallest 
of those which have measure at least k. Lemma 6.2 below will show that for k small enough, this 
actually works. To apply the lemma, we set N = k^ and observe that ||ITp||oo < Halloo < D and 
||IT — ITpIloo < Lemma 6.2 then implies that 

e>l{W) < 2DN-^/^ 

provided 2k, < . Thus for k < 1/2, we may choose k = to show that (6.1) 

holds for K < 1/2. For k > 1/2, that would amount to /c = 0, but fortunately this case is trivial: the 
right side of (6.1) is at least 2D and hence at least 2, while e>^(lT) < 1 for a normalized graphon, 
showing that (6.1) holds for k> 1/2 as well. □ 


Lemma 6.2. Let W he a bounded graphon over some probability space {Q,iF,7r), and let W he a 
graphon over (12, tt) such that ||1F — W\\p < s and W is a block model with N classes. Then 

£>i(lR)<2e whenever k < ^ ( ||pp^||^ )^- 

Proof. Suppose W is based on the partition (Yi,..., L/v) of 12. Arranging the classes 1/ in P in order 
of decreasing measure, let Y^ be the last class of measure k or more. We then define Y / = Ui >£ 
and Yf = Yi for all i < i. Let W be a block model with blocks Y(,... ,Yf and the same values as 
W on Yi X Yj when i,j < £ but the value 0 when i or j equals 1. Clearly W” G To prove the 
proposition, we will have to show that ||1F^ — lF^^||p < e. To this end, we note that W" and W 
agree on 12o x 12o, where 12o = Yi U • • • U Y/-i, and that ||VF' — lT"||oo < ||kF^||oo- As a consequence, 

l|M^' - W"\\p = ||(1T' - 1T")(1 - lnoxQo)llp < l|W"'lloo(l - 7r(12o)2)'/^. 

But because the classes Y/+ 1 ,. •., Ytv have measure smaller than a, 

7r(12o) > 1 — £« > 1 — Nk, 


showing that 

\\W' -W''\\p<\\W'\U{2NKf'^, 
which is bounded by e if a < ^{£/\\W\\ooY- 


□ 


In many applications, the underlying measure on the latent position space 12 does not have 
compact support. Gaussians are a noteworthy case, as are distributions with heavier tails (such 
as Student distributions). Another reason to consider measures without compact support comes 
from the desire to model graphs with power-law degree distributions. As discussed already in 
Section 1.2, bounded graphons do not allow for power-law degree distributions, showing in particular 
that Holder-continuous graphons over equipped with a measure with compact support do not 
lead to graphs that exhibit power-law degree distributions.^*^ For all these reasons, we aim for a 
generalization of Proposition 6.1 to measures whose supports are not necessarily compact. 


10 , 


Once the assumption of compact support is removed, this reasoning no longer applies, and as shown in Section 7, 
there are indeed Holder-continuous graphons over R'* which generate graphs with power-law degree distributions. 
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Since we want graphons to be integrable (in fact, for the least squares algorithm to be consistent, 
we need them to be square integrable) we will restrict ourselves to probability distributions tt over 
in 

Mjs = Itt 

where /? > 0 is a parameter which we will choose to be at least a (or at least 2a when we want to 
guarantee that the graphons in 'Hc,a are in L?‘). 

Proposition 6.3. Let d> 1 and jd > a > 0, let tt £ -Mp, and let W be an a-Holder-continuous 
graphon over'R.'^ equipped with the probability distribution tt, normalized in such a way that ||PP||i = 1. 
If ^ < p < fd/a and k < 1/2, then 

eg(Vh) = 0 (k“') and tailjf)(iy) = 0(/'), 

where /?' = ^ — 1 and a' = yyy, and the constants implicit in the big-O symbols depend on 

the distribution tt and the constants a, fd, p, and C. 

Proof. Let i?o > 1 be such that tt{Aiiq) > 1/2, and let Hq = 4 + 2CRq. Then 


/ kl^<^7r(x) < ooj, 


min W (x, y) < 




w< ' 4. 

7r(ARo) 


Denoting the minimizer of W in Arq x by (xq, yo), we then have W (0,0) < 4+C'|xo|(^+C'|yolTO — 
4 + 2C'i?Q , implying that 

lT(x,y)<Do + C|x|“ +C|y|“ 

for all X, y G It will be convenient to introduce the functions f{x,y) = C'|x|(^ and g{x,y) = C\y\f^ 
and write this inequality as 

W<Do + f + g. 

By our definition of jd' and our assumption on tt, 

\\f\\p(i+y') = c(^j |x|^(i7r(x))'^^^^ - — 


To prove the bound on tailp^^(VT), we observe that 0 < TT — Wp < Wly/yijp. As a consequence, 

tail(^’)(TT) < \\Wlw>i/p\\p < 

O! / \ ^/ 

^ (ll-^o + / + 9\\p{i+0')j 

Q! f \ a! 

^ + ll/llp(i+/30 + ll6'llp(i+^o) — ^P^ 

for some constant D depending on a, fd, p, and C, as well as the measure tt (via Rq and the norm 

ll/llp(l+/3'))- 

To prove the bound on the oracle error, we want to construct a good block model approximation 
to TT. To this end, we first bound the contributions to ||TT||p that come from points x,y outside a 
box A/j, where R>1 will be chosen later. If we set r = CR°^, then the condition (x, y) ^ A^j x A/j 
implies |x|oo > i? or |y|oo > R and hence / + y > r. But 

||TTl/+g>^||p < r“^'||(/ + y)^'TT||p < r"^'||(T>o + / + gf^^Wp < 

and hence 

(6-2) ll^l(IR'ixRd)\(AflXAfl)llp < DiR~^^, 

as long as Di is chosen so that Di > DC~^ . 
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Next we consider a partition V = (Yi,..., Yat) of A/j into cubes of side length 2R/k, with N 
We define f3ij to be the average of W over Yi x Yj, and 

N 


=Y1 

*j=i 




Since W is composed of parts obtained by averaging over subsets in A/j, where W is bounded by 
Dq + 2C'ii“ < -Do(l + -R“) < ‘^DoR°‘, we have 

liw'lloo < 2DiR^, 


provided Di is chosen to be at least Dq. 

Inside A/j x A/j, we bound \ W{x,y) — W\x,y)\ by 

2C{2RlkY = < DiR^^N-^/'^, 

where Di = max{A)o, DC~^', 2(72“}. Finally lY — W' = W outside A^. Combined with the bound 
( 6 . 2 ), we conclude that 

where e = Di{R°‘N~°‘/‘^ + R~^'°‘). With the help of Lemma 6.2 we conclude that 
provided that 

1 + _ 1 /Ar-«/rf + i?-(/3'+i)a. p 

) ~ 2 ) 

1 

and R>1. Choosing R = , we find that 

0' Oi 

£>liW) < 4DiN~Y^), 


provided that k < 



pa-\-d 

d 


Because k < 1/2, we can choose k = 


1 



1 


pa + d 


. Then N = k^ implies 


pa.-\-d 


< N < 


1 

2k 


d 


pcx-\-d 


This yields a bound of 

£>liW) < ADi , 

which is 0[k°^'). Again the implicit constant depends only on a, j3, p, C, and vr. □ 


7. Power-law graphs 

Recall that the normalized degree distribution of a graph G on [n] is defined as the empirical 
distribution of the normalized degrees di/d, where d is the average degree. We say that a sequence 
{Gn)n>o has convergent degree sequences if the cumulative distribution functions Dq^ of the 
normalized degrees converge to some distribution function^^ D in the Levy-Prokhorov distance di,p 
or, equivalently, if Dg„{X) —)• D{X) for all A at which D is continuous. 

We say that the sequence (Gn)n>o Aas a power-law degree distribution with exponent 7 if its 
degree distributions converge to D satisfying 

T>(A) = l-0(A-(^-^)) as A ^ 00 , 

^^That is, a non-decreasing, right-continuous function D:M. —>■ [0,1] such that hmA-»-oo P(A) = 
liniA^oo D{X) = 1. 


0 and 
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and we say that a graphon W has a power-law degree distribution with exponent 7 if Dw = 
1 — as A —)• 00 . 

Note that it is 7 — 1 that appears in the exponent, not 7 . The naming conventions in the 
above definitions are based on density functions, rather than distribution functions: if the degree 
distribution is absolutely continuous with respect to Lebesgue measure and thus has a density 
function /(A), and if /(A) = 0(A”''') as A —)• 00 , then the distribution function D satisfies 

poo 

l-D{\) = j /(A)dA = 0(A-(^-^)). 

In this section, we give two examples of IT-random graphs with power-law degree distributions 
and establish bounds on the convergence rate of our estimation procedures for these graphons. 

We start with an example that can be expressed as a Holder-continuous graphon over W^, even 
though we will first define it as a graphon over [0,1]. It is the graphon 

(7.1) W{x,y) = ^{g{x) + g{y)) where c/(x) = (1 - a)(l - x)"“. 

for some a G (0,1). Note that the degrees of this graphon are ITr = 5 + ^g{x), with a distribution 
function Dw{^) that goes to 1 like 1 — 0(A“^/“) as A —)• 00 , showing that the graphs Gn{pnW) 
have a power-law degree distribution with exponent 7 = 1 -f 

As a graphon over [ 0 , 1 ] equipped with the uniform measure, this graphon is not continuous, but 
it turns out that it can be expressed as an equivalent graphon over that is Holder-continuous. 
To see this, let us consider a probability distribution vr on such that the distribution of the 
norm r = \x \2 of x G is absolutely continuous with respect to the Lebesgue measure on [0, 00 ), 
with a strictly positive density function f{r). We will want to construct a measure-preserving map 
(j): — 7 - [0,1) to obtain an equivalent graphon over Requiring (j) to be measure preserving 

is equivalent to requiring that 7 r((/)“^([ 0 , a])) = 7 r({x: (j){x) < a}) = a. We will construct (/> radially, 
via a map F such that 4>{x) = F{\x\ 2 ), and we will make sure that F is strictly increasing, in which 
case 4’{x) < a is equivalent to \x \2 < F~^{a). Thus, our condition for (p to be measure preserving 
becomes a = f l| 3 .| 2 <i 7 ’-i(a)dvr(x), or equivalently, f l\x\^^^d'K{x) = Fir), showing that F{r) is the 
cumulative distribution function of \x \2 (which is strictly monotone by our assumption that /(r) >0 
for all r G [0,oo)). Taking F{r) = 1 — we get 

1 — a/ 1 1 — a f 1 

^’^) = ^“(i-F(|x|2 )) ^^r\l-Fi\y\2)) 

1 — rv 

= ^((i + H2)“ + {i + i!/i2r), 

showing that IT is equivalent to an a-Holder-continuous graphon over equipped with any measure 
for which the cumulative distribution function of |x |2 is equal to F. As a consequence, we may use 
the results of Section 6 to give explicit bounds on the estimation errors for the least squares and 
least cut algorithms. We will not give these bounds here, since for IT of the form (7.1), one can 
obtain slightly better bounds using the actual form of IT; see Lemma 7.1 below. 

The second example we consider in this section is the graphon IT over [0,1] that is defined by 

(7.2) W{x,y) = g{x)g{y) where again (/(x) = (1 — a)(l — x)“". 

As before, we equip [0,1] with the uniform measure. Now the degrees of IT are equal to g{x), which 
shows that again, the IT-random graphs obtained from IT have power-law degrees with exponent 
7 = 1 -f -. 

' ' a 

Note that the second graphon cannot be expressed as a Holder-continuous graphon over in 
the sense of Section 6 . Indeed, suppose IT were such a graphon. By Theorem 2.9, there would 
exist a standard Borel twin-free graphon U such that IT = for some measure-preserving map p 
from to the space on which U is dehned. Since IT is twin-free as well we may without loss of 
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generality assume that U = W (use Theorem 2.8). But this means that W would be of the form 
W{x,y) = W{(j){x),(l){y)) = g{(/){x))g{(l){y)) for some measure-preserving map cj): —)■ [0,1]. Since 

g{4>{x)) is unbounded, this cannot be a Holder-continuous function of the argument y. 

Nevertheless, we can give explicit bounds on our estimation error since for W of the form (7.1) 
or (7.2), we can estimate e>^(lT) and tailp^^(lT) directly. 

Lemma 7.1. Let a G (0,1), let 1 < p < 1/a, and define a' = ^ — a and fd' = ■ If W is the 

power-law graphon (7.1), then 

e^ll{W) = 0(a“') and tailjf)(lT) = 0(/'), 

while ifW is the power-law graphon (7.2), then 

ey\{W) = 0(a“ ) and tail^^^(lT) = \ log pi 

Proof. We start with the proof of the tail bounds. Defining gi,g 2 ' [0,1]^ —)• [0, oo) by gi{x, y) = g{x) 
and g 2 {x,y) = g{y), we write the first graphon as ^{gi -f 52 )- Noting that W > p~^ implies that 
either gi > 1 /p or g 2 > 1/p, we bound 

||1H - WpWp < ||lDlu/>i/p||p < \\W{lpg,>i + lp,i>i)||p 

~ T 92lpg2>i\\p 

l-pa 

The two terms can easily be calculated explicitly, giving a term of order 0[p ) for the first 

and a term of order 0(pp“) for the second. For the second graphon, we note that the condition 

W{x,y) > 1/p is equivalent to (1 — x){l — y) < (p(l — ■ Changing to the variables 1 — x 

and 1 — y, we have to estimate the integral 

(xy)“P“l^j^<^i/c dx dy. 

The integral can again be calculated explicitly, giving an error term of order 0[p ^ | log p|). Taking 
the root, we obtain the claimed tail bound for the second graphon. 

All that remains is to estimate the oracle errors. Let fi,... ,1^ be a partition of [0,1] into k 
adjacent intervals of size e = ^ (ordered from left to right), let g' be the function obtained by 
averaging g over these intervals on Ji U /2 • • • U Ik^ (where ko will be determined later), and let g' = 0 
on the remaining intervals. Define gi,g 2 - [0,1]^ —)■ [0, 00 ) as above, define g[ and P 2 analogously, 
and set W' = ^(g[ -j- g^) for the graphon (7.1) and W' = p(p 2 for the graphon (7.2). With this 
notation, 

||1T - IT'llp = ^||pi + g2- g'l - P 2 IIP = IIS' - lip 

for the graphon (7.1), and 

||VF - IT'llp = ||pip2 - ffiS^llp < IKsi - 5 'i)5'2|Ip + ||5'i(5'2 -5'2)IIp < IlS'llpllS' - lip 

for the graphon (7.2). So all we need to show is that ||p — p'||p = 0(e" ). 

For i < ko, let Xi G fi be defined by I fj. 9 = g{xi)- For x G fi, we bound \g{x) — g{xi)\ < 

maxyg/. |x — Xi\, implying that the integral of \g(x) — p(xi)|P over fi can be bounded by 
gP+i maxyg/. < £^■’■^(1 — ie)“PC+“), Summing over i = 1,..., ko, we get a contribution of 
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0(e^(l —/coe)^ io\\g — g'\\p. The integral of from fcoe to 1 will contribute O((l — /coe)^ 

As a consequence, the choice ko = k — 1 (which gives 1 — k^e = e) leads to the estimate 

\\g-g'\\P = 0{e^-‘^P), 

as desired. □ 
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Appendix A. Couplings, metrics, and equivalence 

We start this appendix by reformulating Remark 2.1 in the more general setting of Borel spaces. 

Lemma A.l. Let p > 1 and let W and W' be LL graphons over two Borel spaces (fl, W, vr) and 
(n',W', 7 r'). Then the following hold: 

(i) The infima in (1.1) and (2.3) are attained for some couplings u. 

(a) If (n, vr) and (fl',W', 7 r') are atomless, then the distances 5p{W,W') and 5u{W,W') can he 
expressed as 

5p{W, W') = inf IIW - {W')^\\p = inf || W - {W'f\\p 

(p $ 

and 

5u{W, W') = inf ||W - (WO'^lln = inf ||W - (VL')'*’l|n, 

4 > 4 > 

where the infima over 4> re over measure-preserving maps from to fl' and the infima over are 
over isomorphisms from to Tt!. 

For the cut metric, the first statement is a special case of Theorem 6.16 in [44] (see also Lemma 
2.6 in [12], which proves the statement for bounded graphons over [0,1]), while the second is 
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essentially^^ given in Lemma 3.5 in [18]. The proofs for the distance 5p are virtually identical. For 
the convenience of the reader, we sketch them below. 

Note that the first statement does not hold without the assumption that (fl, tt) and {O!,J-',tt') 
are Borel spaces; see, for example, Example 8.13 in [44] for a counterexample. Similarly, the 
assumption that (fl,T", tt) and {Vt',J-', tt') are atomless is needed for the second statement to hold; 
see Remark 6.10 in [44]. (Indeed, the condition involving <I> does not even make sense unless Q 
and are isomorphic, but all atomless Borel spaces are isomorphic by Theorem A.7 in [44]. For 
arbitrary probability spaces there may not even be any measure-preserving maps from to fib) 


Proof. We begin with part (i). For the cut metric, this is a special case of Theorem 6.16 in [44]. 
The proof for the metric dp is very similar. For the convenience of the reader, we give the proof 
below, combining proof techniques from [44] and [12]. 

Let M be the set of all probability measures on x fl' for which the marginals are vr and n'. We 
first observe that A4 is compact in the weak* topology. To see why, first note that by Theorem 
A.4(iv) in [44], the measurable spaces and {Pt'^J-') are either countable (with all subsets 

measurable) or isomorphic to [0,1] with the Borel a-algebra. Let Ao be the set of all A C x Q.' that 
are products of intervals with rational endpoints in the [0,1] case and finite sets in the countable 
case. Since Ao is countable, any sequence of measures Un G M has a subsequence such that 
u(j(A) converges for all A £ Aq. Since Ao generates the product u-algebra on fl x fl', the limit can 
be extended to a probability measure n on ^ x Q', which can easily by checked to have vr and tt' as 
marginals, implying that ^ £ Ai. 

Consider a sequence of couplings On such that 

P ! , \ 

dun{x,x')dvn{y,y)] 


(A.l) Sp(W,W')= lim / W(x,y) -W\x,y) 

^ n^oc V / 


By the compactness of Al, we may pass to a subsequence (which we again denote by for which 
there is a limit v £ Ai such that I'niA) -£■ u(A) for all A £ Aq. Since u G Ai, 


6p{W, W') < 


W{x,y) — W'{x',y') dv{x,x') dv{y, 


i/p 


To prove a matching lower bound we fix e > 0 to be sent to zero later. By (A.l), we can find an 
riQ such that 


6p{W,W') > 



W{x,y) -W'{x',y') dvn{x,x') dun{y, 


i/p 


— e. 


for all n > no. Since W £ IP, we can find an M such that 111F1 vu>al11p < s, and since Wlw<M 
is bounded, we can find a graphon W which is a finite sum of the form W = with 

Ai £ Ao such that 11VF1 u/<m — implying in particular IJlF — VFjjp < 2e. In a similar 

way, we can find W' of the form W' = Ylk il^'k with Bi £ Ao and jjlF' — W'\\p < 2e. As a 

consequence 


Sp{W,W')>{ j\w{x,y)-w'{x',y')\''dMx,x')dun{y,y')^ -he 

\ ^Ip 

lAi - l3'ke\^J^n{Ai X Bk) UniAj X R^) I - 5e 


^^While Lemma 3.5 in [18] was only stated for bounded graphons over [0,1], the generalization to unbounded 
graphons over an atomless Borel space is straightforward. 
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for all n > no- We can take the limit as n —)■ oo on the right side, to obtain the bound 
W')>(y, I Ai - X Bk) u{Aj X 5^)^ - 5e 




> 


r ~ ~ p \ 

/ W{x,y)— W'{x',y') dh'{x,x') du{y,y) j —5s 

J W{x,y)-W\x',y') dv{x,x) dv{y,y)\ - 9e. 


Since e was arbitrary, this proves part (i) of the lemma. 

We now turn to part (ii). All atomless Borel spaces are isomorphic to [0,1] (with the Borel 
cj-algebra and uniform distribution), by Theorem A.7 in [44]. Thus, we can assume without loss of 
generality that 14 and 14' are both [0,1]. 

Choosing z uniform at random from [0,1], the map z i-a (z,(/>(z)) provides a coupling showing 
that 5p{W, W) < || W — (W')'^||p and Sa{W, W) < inf^ ||4T — (W')‘^||n. It is also obvious that 

in4 HIT - (IT')'^llp < inf$ ||IT - (IT')'*’||p and inf^ ||IT - (IT')^||n < inf$ ||IT - (IT')'*’||n. 

To prove equality, one first approximates IT and IT' by piecewise constant functions (more 
precisely, graphons on [n] equipped with the uniform measure), and then approximates an arbitrary 
coupling of two uniform measures on [n] by a bijection on a “blow-up” [nk] of [n]. Mapping this 
bijection back to an isomorphism : [0,1] —)• [0,1] then gives a lower bound on Sp{W, IT') in terms of 
inf$ jjlT^ — IT'llp, minus some error which can be taken to be arbitrarily small. The details are very 
similar to the proof of Lemma 3.5 in [18], which proves equality for the cut norm when IT and IT' 
are bounded, and we leave them to the reader. Note that the generalization to unbounded graphons 
is straightforward, given that ||IT1 w>m|Ip —)■ 0 as M —)■ oo and ||ITlu/>M||n < ||W1 u/>m||i- D 


In the remainder of this appendix, we prove most of the theorems from Section 2.4. We rely 
heavily on both the results and the techniques of [14] and [44]; see also [12]. Before turning to these 
proofs, we relate the notion of equivalence from Definition 2.5 to the notion of “weak isomorphism” 
from [14], which requires the maps (p and (j)' to be measure preserving with respect to the completion 
of the spaces (14, ^”, 71 ) and (14', T"', vr'). It is clear that equivalence implies weak isomorphism, since 
maps that are measurable with respect to (14, ^”, 71 ) and (14', T"', tt') are clearly measurable with 
respect to their completions. We can also turn this around, at least when the third space is a 
Lebesgue space, i.e., the completion of a Borel space. This follows from part (i) of the following 
technical lemma. 


Lemma A.2. Let W and IT' be graphons over two probability spaces (14, tt) and (14', T"', tt'), 
respectively. 

(i) Assume that there exist measure-preserving maps cp and cp' from the completions of {Q,iF,7r) 

and (14', J^', tt') to a Lebesgue space (14", tt") and a graphon U over (14", tt") such that 
W = and IT' = almost everywhere. Then there exists a standard Borel graphon U and 
measure-preserving maps <p and (p' from and (14', T"', tt') to the Borel space on which U 

is defined such that IT = and IT' = U^' almost everywhere. If U is twin-free, then U can be 
chosen to he twin-free as well. 

(ii) If and (14', T"', tt') are Borel spaces and IT and IT' are isomorphic modulo 0 when 

considered as graphons over the completion of (14, B, tt) and (14', , tt'), then they are also isomorphic 

modulo 0 as graphons over (14, tt) and (14', J^',7r'). 

Proof, (i) Since every Lebesgue space is isomorphic modulo 0 to the union of an interval [0,p] and a 
collection of atoms Xi (see Theorem A. 10 in [44]), we may without loss of generality assume that 
(14", T"", tt") is of this form. Assume without loss of generality that the atoms are represented as 
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points Xi £ {p, 1], SO that (j) takes values in [0,1]. Noting that is the completion of a Borel 
(T-algebra , define U as the conditional expectation K[U \ B" x B''\. Then ?7 is a Borel graphon 
such that U = U almost everywhere. Since cj) is measure preserving from the completion vr) of 

to (n", T"", tt"), it is also measure preserving from to {il”,B” Replacing (j) 

by the conditional expectation (j) = K[(p | T"], we obtain a measure-preserving map cj) from (0,^", vr) 
to {Vt" ,B" such that W = almost everywhere. If U is twin-free, then so is U. 

(ii) The completions of {il,T,7r) and are Lebesgue spaces. Since every Lebesgue 

space is isomorphic modulo 0 to the disjoint union of an interval [0,p] (equipped with the Lebesgue 
cj-algebra and the uniform measure) and countably many atoms Xi, we have that as graphons over 
the completion of (R,T", tt) and (O',T"', tt'), both W and W are isomorphic modulo 0 to a graphon 
U over such a space. Proceeding as in the proof of (i), we can then replace C/ by a Borel graphon U 
such that W and W are isomorphic modulo 0 to the graphon [/, which in particular implies that 
W and W are isomorphic modulo 0 . □ 

Proof of Theorem 2.8. If W and W are isomorphic modulo 0, they are clearly equivalent. Assume 
on the other hand that W and W’ are equivalent. Moving from (HjT', vr) and {Vt', 11 ') to their 
completion, we obtain graphons which are defined on a Lebesgue space and are weakly isomorphic 
in the sense of [14]. For bounded graphons, we can then use Theorem 2.1 of [14] to conclude that 
W and W' are isomorphic modulo 0 as graphons over the completion of tt) and (O', T"', vr'). 

By Lemma A.2, this implies that they are also isomorphic modulo 0 as graphons over (RjT", vr) and 

If W and W' are unbc^nded, (let W = tanhIF and W’ = tanhIT'. Clearly, W and W' are 
equivalent if and only if W and W are equivalent, and W and W are isomorphic modulo 0 if 
and only if W and W are isomorphic modulo 0. Therefore the unbounded case follows from the 
bounded case. □ 

Proof of Theorem 2.9. For bounded graphons, the analogous statement for graphons over a Lebesgue 
space was proven in [14]; in particular, by Corollary 3.3 from [14], we can find a twin-free graphon U 
over a Lebesgue space {O!,t:') and a measure-preserving map from the completion of (R,T', vr) 
to ,J-’,tt') such that W = almost everywhere. By Lemma A.2, this implies the existence of a 
twin-free standard Borel graphon [/ on a Borel space (II, T", if) and a measure-preserving map from 
(n, T", tt) to if) such that W = almost everywhere, which proves (ii) for bounded graphons. 

Statement (i) follows from (ii) by expanding the atoms Xi in 17 into intervals of widths pi = n{xi). 

To reduce the case of unbounded graphons to the case of bounded graphons, we again use the 
transformation W 1 —)■ tanh W, which maps arbitrary graphons to bounded graphons. □ 

Proof of Theorem 2.11. We first note that the implications (hi) (ii) => (i) are trivial. So all that 
remains to prove is that (i) (iii), and by Theorem 2.9, it will be enough to prove this for graphons 
W and W over [ 0 , 1 ] equipped with the uniform distribution. 

Assume thus that W and W are graphons over [0,1] with = 0. By Lemma A.l 

this implies that W and W can be coupled in such a way that ||VF — VF^Hn = 0 , which in turn 
implies that W{x.,y) = W'{x',y') almost surely with respect to this coupling. As a consequence, 
^□(tanh IT, tanh IT') = 0. By the results of [14], this implies that tanh IT and tanh IT' are equivalent, 
which in turn gives that IT and IT' are equivalent, as required. □ 


Appendix B. Concentration bounds 


We start with a slight generalization of the multiplicative Chernoff bound. 
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Lemma B.l. Let Xi,... ,Xn be independent random variables with values in M, let X = 
and suppose there exists Xq G [0, oo) sueh that 

E[Xf] < Xo for all m>2. 


Pr(X — E[X] > Xot) < exp min{t, r} — j 

for t>0. 

Proof. As in the proof of the standard Chernoff bound, we estimate the expectation of for a 
constant a > 0 to be determined later. To this end, we first bound 


E|c“-*'| = 1 + aElA'il + Yi 


expfa;E[Xj] + ^ 


a^E[X^‘ 


which together with the assumption of the lemma proves that 


E[e“^] < exp(aE[A] + ^ ^ 


m>2 i 


As a consequence, 


Pr(X > E{X) + tXo) = > l) 

^ g(e“-o-l)Xo-taXo 

Choosing a = log(l + t) gives e“ — 1 — a — fa = t — (t + 1) log(t + 1) and hence 

Pr(A > EiX) + tXo) < < exp ^min{t, . 


Lemma B.l immediately implies the following lemma. To state it, we dehne, for an arbitrary 
symmetric matrix Q G [0, with empty diagonal, the random symmetric matrix A = Bern((5) G 
{ 0 , 1 }’^^"' obtained by setting Aij = Aji = 1 with probability Qij, independently for all i < j, 
and Aij = 0 whenever i = j. Note that with this notation, E[A^] = for all tt; [n] — ?• [k]. The 
following lemma states that A^^ is tightly concentrated around its expectation. 

Lemma B.2. Let 1 < k < n, let Q be a symmetric n x n matrix with entries in [0,1] and empty 
diagonal, and let A = Bern((5). Let e be the random variable s = max^. [n]-s>[fc] II^tt — QttIIi- Then 

(B.l) E[e] < 9 

If np{Q) > 1, then with probability at least 1 — e 

(B.2) e < 8 

Recall that p{Q) means ^ Qij- 
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Proof. We begin with the proof of (B.2). We distinguish two cases: 

If ^ ^ > p{Q)^ we need to show is that with probability at least 1 — e~'^, the left side 

is at most Sp{Q). To prove this, we bound 

II^TT “ QttIIi < ll^vrlll + IIQttIIi = ll^lll + IIQIIl- 

Now we apply Lemma B.l to the random variable X = ^ij- Because A'fj] = Y2i<j Qij — 

^p{Q), we can take Xq = ^p{Q). Taking t = 6, we see that with probability at least 1 — > 

1 - e-^, 

P. - Q.lli < 2||Q||i + 6p{Q) = 8p{Q). 

If + ^ < p{Q)j we will use a union bound over all vr: [n] —)■ [A:]. Considering first a fixed 

tt; [n] —>• [A], we rewrite 

2 

II^TT QttIIi — / ^ {Quv 4li(^,)sign((Q7I-)™ (^7r)tiv) 


u<v 


— max „ \ ^ Buy {Quv Auv)j 


u<v 


where consists of all matrices with entries ±1 that are constant on the partition classes of vr (note 
that Btt has size 2^o, where Pq < k is the number of non-empty classes in vr). Applying Lemma B.l 
again, this time to the random variables BuvAuv., noting that J2u<v ^ii^uvAuv)'^] < Ylu<v ^ 

^yp{Q), and using the union bound to deal with the maximum over B' G we find that 


Pr (P^ - Q^lli > tp((5)) < 2*^ exp - 


min{t, 2 


6 


n p{Q) . 


Setting 


t = 6\ 


1 -|- log k 


+ 


k‘^ 


np{Q) n^piQ) ’ 
our case assumption implies that t < 6, which in turn implies that 


min{t, > — = 6 


6 


1 -|- log k 
np{Q) 


+ 


n- 


k"^ \ 


As a consequence, for each partition tt: [n] —)• [/c], 

Pr(p^ - QttIIi > tp{Q)) < exp (A:^log 2 - n(l log A) - k^) 

< g-^(l+logA:)_ 


Taking the union bound over all partitions vr: [n] —)■ [fc], we obtain the desired bound. 

All that remains is to prove (B.l). If np{Q) < 1, we bound 

E[e] < E[P - Qlli] < IIQIIi + E[P||i] = 2piQ) < 2^^^. 

If np{Q) > 1 , we use (B. 2 ) and the fact that e < ||Q||i + Pl|i < 2 to bound 

EW<8^pW)(I±PP)+2.-» 

Because 2e~^ < 1/n < \/np{G)/n = p{G)/n, this completes the proof. □ 


Our next lemma states that a similar bound holds for the cut norm of A — Q. 
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Lemma B.3. Let n>2, let Q be a symmetric nxn matrix with entries in [0,1] and empty diagonal, 
and let A = Bern{Q). Then 

E[\\A-Q\\o]<16J^. 

V Tt 

If np{Q) > 1, then with probability at least 1 — e~'^, 

(B.3) \\A-Q\\a<l5^l^. 

Proof. A bound of the form (B.3) can easily be inferred from Lemma 7.2 in [15]. For the convenience 
of the reader, we given an independent, slightly simpler proof here. 

Let iFn be the set of functions /: [n] —>■ {—1, +1}. It is not hard to check that 

||A - Qlln < max ^ ^ f{i)g{j){Aij - Qij) 

J n 71 

2 ^ 

< max ^ > f{i)g{j){Aij - Qij). 
f,geTn 

K] 

Proceeding as in the proof of Lemma B.2, a union bound and Lemma B.l now imply that 

Pr(p-Q||n >tp{Q)) < 4^ exp . 

Choosing t = 6 log(4e)/ np{Q) and observing that 61og(4e) < 15 then gives the high probability 
bound. The bound in expectation follows from this bound and the observation that || A—Qjjn < 2p{Q). 
Indeed, if np(Q) > 1, then 

l5y^p{Q)/n + 2e~^p{Q) < 15y^p{Q)/n + 2p{Q)l (en) 

< p{Q)/n 

(for the final step recall that p{Q) < 1), and if np{Q) < 1, then 2p{Q) < 2\Jp{Q)ln. □ 

Appendix C. Proofs of Lemmas 2.19, 2.20, and 2.21 

We start with the following lemma, which is an easy consequence of the law of large numbers for 
t/-statistics. 

Lemma C.l. Let vr) be a probability space, and let W: LI x Ll ^ R. be in IP for some p > 1. 

Then ||F„(lT)||p^ IjlTlIp a.s. 

Proof. Define U = jlTj^, and choose xi,. .. ,Xn i.i.d. with distribution vr. Then 

\\Hu(w)\\i = 4 E = 4 E V)- 

By the strong law of large numbers for ?7-statistics (see, for example, [401), the right side converges 
to \\U\\i = \\W\\P as claimed. □ 

Next we prove Lemma 2.20. 

Proof of Lemma 2.20. We first note that the statement clearly holds if W is replaced by the block 
model = VFpj,, where Vk is the partition of [0,1] into consecutive intervals of length 1/k. To 
see this, one just needs to use the fact that as n —)■ oo, the fraction of points Xi which fall into the 
interval converges a.s. to 1/k. 
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To prove the statement of the lemma for general W, we will use Lemma C.l. Let p = pn, fix 
e > 0, choose k so that ||VL — < e, and let M be large enough that ||invu>M||p < £■ Also, 

define Wp = min{lT, l/p}. Noting that ^Qn = Hn{Wp), we then bound 

\\W - ^QnWp = \\W - Hn{Wp)\\p 

P 

< ||1T - W^^^Wp + ||1TW - Hn{W^^^)\\p 

+ - Hn{W)\\p + \\Hn{W) - Hn{Wp)\\p. 

Assuming n is large enough to ensure that p~^ > M (which in turn implies that |1T — lTp| = 
W — Wp < W1 w>m), we then bound the right side by 

e + IIITW - Hn{W^>^^)\\p + \\Hn{W^^^ - W)\\p + \\Hn{Wlw>M)\\p. 

As n —)• oo, the second term goes to zero with probability one, and the third and the fourth both 
converge to quantities which are at most e by Lemma C.l. Thus, with probability one, the limit 
superior of ||1T — ^Qn\\p is at most 3e. Since e was arbitrary, this proves the claim. □ 

Next we prove Lemma 2.19. To this end, we start with a simple technical lemma. We use A to 
denote the Lebesgue measure on [0,1] or [0,1]^ (depending on the context), and, as usual, we use 
A /\ B to denote the symmetric difference of two sets A, B, i.e., A A B = {A \ B) L) (B \ A). 

Lemma C.2. Let W and W’ he of the form W = j BijlvixY^ o,nd W' = ■ BijlyfxYP where 

B is a k X k matrix, and (Li,..., Y^), (T/,..., Y^) are partitions of [0,1]. If X{Yi A Yf) < eX{Y) 
for all i, then 

||lT-lT'||p< ^2e(l + e)||lT||p. 

Proof. We begin with the bound 

11 ^ - ^'llp = II 

^ 11 * ^ ^ I p 

id 

<Y.\B^3\^KiY^y<Y,)A{YlxY;)). 

i,j 

We have 

(T, X y,) A (y/ X Y') c ((y, u y/) x (y, a y')) u ((y, a y/) x (y, u y')) . 

Combining this containment with A(yj A Yf) < eA(yj) and A(yi U Yf) < (1 + e)X{Yi) yields 

IIW - W'll^ < 2e(l + e) \B,,\n{Y U y,) = 2^(1 + £)||iy||^, 

id 

as desired. □ 


Remark C.3. A slight variation of the above proof also shows that 

no matter how large the measure of the symmetric differences Yi AY- is. To see this, just bound 

||iy - ly'll^ < J] |ii.,rmax{A(yi X y,), A(y/ x y/)} 
id 
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Proof of Lemma 2.19. If k = 1, — 'S>k(II^) and there is nothing to prove. We may 

therefore assume without loss of generality that n G (0,1). 

To prove the first bound, we write W' as (p, B) and reorder the elements of [k] so that pi < 
P 2 < • • • ^ Pk- Also, without loss of generality, we may remove all labels with pi = 0, so that 
Pi > K for all i. Dehne W” = (p", B), where p" is obtained from p so that for each f, p'[ + • • • Tp” 
equals pi + ■ ■ ■ + Pi rounded to the nearest multiple of 1/n (with the convention that in the case 
of ties, we choose the point to the left). After embedding both W' and W" into the space of 
graphons on [0,1], we can write the resulting graphons W" = VJ\W"] and W' = W[1T'] in the form 
W” = j BijlY"xY” s-iid W' = Yli j where Yi and Yf are intervals whose endpoints 

differ by at most l/(2n). As a consequence A(li A Yf) < ^ By Lemma C.2 and the 

fact that ^ < 1, this implies that 

(c.i) psi[w']-nw'x<^\\w'\\i. 

Kii 

To complete the proof of the hrst bound, all we need to show is that IT'' G Bn,>K, which means 
we need to show that np” = nXiYl') > Bet io be the first i such that npi is not an 

integer. For f < fo, we then have np'f = npi > nn > . On the other hand, for f > io, we can use 

\npi — np”\ < 1, which follows from |n(pi + ■ ■ ■ + Pi) — n{p'{ + ■ ■ ■ + p'l)\ < 1/2. We then conclude 
that np'l > npi — 1 > npi^ — 1 > [npigj — 1 > [krJ — 1, where we used that npi^ is not an integer. 
Since np'f is an integer, this implies np” > [nK\, which shows that IT" G Bn,>K.- Identifying IT" 
with the corresponding matrix in An,>K, this proves the first bound. 

To prove the second bound we first observe that the minimizer IT" = (p", B) G Bn,>K obeys the 
bound ||lT"||p < 2||lT||p. Our task is now to hnd a block model IT' G that approximates IT" 
in the norm 5p. Let k” be the number of classes in IT"; again, we assume without loss of generality 
that they are all non-empty, which means we have that p” > Kn for all i G [A:"], where Kn ■= ^ [nn]. 

We would like to increase p” to k whenever it is smaller than that, while compensating for this 
by decreasing those probabilities that are larger than k. However, there is a potential obstruction, 
namely that k”n could be greater than 1, in which case it is clearly impossible to increase all k” 
probabilities to at least n. For comparison, we know that k"Kn < 1, but that is a slightly weaker 
assertion. 

To deal with this difficulty, we will show that there exist some no depending on k such that for 
n > no, we do have nfe" < 1. First, note that Kn > k — Thus, 


k” < 


1 


K — 1/n 

As n —)■ oo, 1/(k — l/n) approaches 1/k from above, and thus 


1 


1 

K — l/n 


K 


for all sufficiently large n. If we take no to be sufficiently large, then for n > no we have 


k” K < 


1 

K 


K <1. 


Given this, we now dehne IT' = (p,H) as follows: let 7_ be the set of indices i G [/c"] such that 
p'l < K, and let 5 = X ~ Pi)- Bor i G I-, we then set pi = k, while for i ^ we hrst decrease 

the largest p” until we either hit k or have used up the excess S. If we stop because we hit k, 
then we move to the next largest p”, etc. Since in the second step, we will eventually use up the 
excess 5, this process constructs a distribution p such that pi > k for all i G [k”], and such that 
\Pi — Pi\ = 2(5. Note for future reference that <5 < k”/n. 
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Writing the embedding W[W"] of W into the set of graphons over [0,1] as Yhi j - 6 ijly,"xY'G we 

ij '^3 

construct corresponding measurable sets Yi such that Yi,..., Yk" forms a partition of [ 0 , 1 ] with 
\{Yi) = Pi and \{Yi A Y”) < \pi — p'l\. (Each set Yi will be either a superset or a subset of Y”, 
according to whether p'l was increased or decreased.) 

For i £ I-, 

A(l'.AF")<|p.-p"|<i<^A(r"), 

11 ^ 

For i ^ I-, 

A(y, A y") < Ip, - p''| < <5 < - < ^A(y"). 

n K^n 

When n is sufficiently large, Hn > Increase no enough for this to hold, as well as no > 1/n^. 
Then for n > uq, 

VW',W'') < f;£\\W'X < 

by Lemma C.2, as in the proof of the first bound. This concludes the proof of the second bound. □ 


For bounded graphons, the next lemma was proved in [20]. 

Lemma C.4. Let V = (W,..., Y^) be a partition of [0,1] into consecutive intervals, and let W be 
a graphon over [0,1] that is constant on sets of the form Yi x Yj. If xi,... ,Xn are chosen i.i.d. 
uniformly at random from [0,1] and Hn is the n x n matrix with entries W{xi,Xj), then 

4(^n,iy) < f/2e{l + e)\\W\\p, 

where e is the random variable 


with Ui denoting the number of points xe that lie in Yi. 


e = max— , 

Aiy 1 Vn 


(Y 


- - A(l',) 

n 


Proof. Let /i, ...,/„ be a partition of [0,1] into adjacent intervals of lengths 1/n. Then W[iL„] is of 
the form '^ij BijlyixYh where Y- is the union of of the intervals Ii,... ,In (which particular n, 
intervals depends on the labeling of the vertices of Hn). In fact, given a map tt: [n] —)• [A:], define 
y/ = YI{tt) to be the union of all intervals le such that 7r{£) = i, and let iy(7r) = j ^ij^Yf{iT)xY!{TT)- 
Then 

(52(/^„,iy) =min||iy(7r)-iy||2, 

TT 

where the minimum is over all tt such that |vr“^({i})| = for all i. In view of Lemma C.2, we will 
want to keep the Lebesgue measure oiYi AY- small for all i. We claim that this is indeed possible, 
and that tt can be chosen in such a way that 


(C.2) 


T) • 

A(yiAy/)< --x{Yi) 

n 


H— for all i. 

n 


To prove this, we note that choosing tt is equivalent to choosing, for all i, Ui of the intervals Ii,..., /„ 
to make up Yf. 

Let Yi,... ,Yk be obtained from Yi,... ,Yk by rounding the endpoints to the nearest integer 
multiples of 1/n, choosing the multiple to the left in case of a tie. With this convention, 


LA(yi)nJ <X{Yi)n< rA(yi)nl. 

Thus, if n, < X{Yi)n, then n, < nX{Yi), while if n, > X(Yi)n, then Ui > nX{Yi). Keeping this in 
mind, we see that for n, < X{Yi)n, we can find at least n, intervals Ig that, except possibly for their 
endpoints, are subsets of 1/. We will define Y^ to be the union of these intervals. In a similar way, 
if n, > X{Yi)n, we choose nX{Yi) < m intervals (namely, those forming y) to build a preliminary 
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set Having done this for all i, we take a second run through all i with n* > X{Yi)n, choosing 

an arbitrary set of n* — \{Yi)n intervals from those not yet assigned at this point. At the end of 
this round, we end up with sets It' such that Y^ is the union of rii intervals from Ii,... ,In, with 
the additional property that 

either Y- C Yi or Y- C Yi. 

But this implies that A(y/ AY) = \ ^ — A(li)| for all i. Since the endpoints of Yi get shifted by 
at most l/(2n) in order to obtain Yi, the additional error in going from to is at most 1/n, 
proving (C.2). Combined with Lemma C.2, this concludes the proof. □ 

Finally, the following lemma implies Lemma 2.21. 

Lemma C.5. Let e and the other notation be as in the previous lemma, suppose that all sizes ofV 
have measure at least k, and let rj £ (0,1). Then 

1 

e <-h max 

Kn 

with probability at least 1 — rj. As a consequence, if C is a positive real number, then 

= 0 „ lIM.'ll, 

whenever < C, with the constant implicit in the Op symbol depending on C. In addition, if 
K = Kn is such that lim sup -Y log n < C, then with probability one, there exists a random no such 
that for n > no, 

= O \\«%, 

with the constant implicit in the big-0 symbol again depending on C. 


3 . 2 

— log—, 

UK KT] 


— log—I 

UK KT] J 


Proof. By the multiplicative Chernoff bound, 


Pr 


- - A(H.) 

n 


> 


< 2 exp — 


nX(Y) r 2. 

;—- mm{t, t^} 


UK 


< 2 exp ( — YA min{t, t^ 

O 


so by the union bound and the fact that the number k of classes is at most 1/n, we get 

1 2 /UK r 2 t\ 

e <t -\ -with probability at least 1-exp-minjt, t | . 

Kn n V 3 / 

Setting y = Y: log Y we see that with probability at least 1 — rj, e<t + Y- whenever min{t, t^} > y. 
This implies the bound on e. 


For the remaining part of the proof, choose rj = 2n ^. Then with probability at least 1 — 2n 


-2 


1 


n 


e <-h max < — log —, \ — log — 


Kn 


nK 


n^ 


nK 


1 


9 


<-h max < — log 2Cn, \ — log 2Cn 


Kn 


nK 


9 


nK 


ICC 

(because — < -- < -- < 2(7) 

nK log n log 2 
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for some C depending on C. This implies 2e(l + e) < 2(1 + ^/CC ")^^=: C and 
hence 

\ UK 

Since the failure probability 2n“^ is summable, this implies the a.s. statement. To prove the 
statement in probability, we note that by Remark C.3, 6p{Hn, W) < which shows that 


E 


{5p{Hn,W)r 


< 


C log n r] 

UK K? 


C log n ^ 2 


UK K?V? 


This implies the statement in probability. 


□ 


Microsoft Research, One Memorial Drive, Cambridge, MA 02142 
E-mail address: borgs@microsoft.com 

Microsoft Research, One Memorial Drive, Cambridge, MA 02142 
E-mail address: jchayes@microsoft.com 

Microsoft Research, One Memorial Drive, Cambridge, MA 02142 
E-mail address: cohn@microsoft.com 

Department of Mathematics, University of Washington, Seattle, WA 98195 
E-mail address: sganguly@math.washington.edu 



